Hybrid Collaborative Filtering with Autoencoders
Abstract
Collaborative Filtering aims at exploiting the feedback of users to provide personalised recommendations. Such algorithms look for latent variables in a large sparse matrix of ratings. They can be enhanced by adding side information to tackle the wellknown cold start problem. While Neural Networks have tremendous success in image and speech recognition, they have received less attention in Collaborative Filtering. This is all the more surprising that Neural Networks are able to discover latent variables in large and heterogeneous datasets. In this paper, we introduce a Collaborative Filtering Neural network architecture aka CFN which computes a nonlinear Matrix Factorization from sparse rating inputs and side information. We show experimentally on the MovieLens and Douban dataset that CFN outperforms the state of the art and benefits from side information. We provide an implementation of the algorithm as a reusable plugin for Torch, a popular Neural Network framework.
1 Introduction
Recommendation systems advise users on which items (movies, musics, books etc.) they are more likely to be interested in. A good recommendation system may dramatically increase the amount of sales of a firm or retain customers. For instance, 80% of movies watched on Netflix come from the recommender system of the company [Netflix2015]. One efficient way to design such algorithm is to predict how a user would rate a given item. Two key methods coexist to tackle this issue: ContentBased Filtering (CBF) and Collaborative Filtering (CF).
CBF uses the user/item knowledge to estimate a new rating. For instance, user information can be the age, gender, or graph of friends etc. Item information can be the movie genre, a short description, or the tags. On the other side, CF uses the ratings history of users and items. The feedback of one user on some items is combined with the feedback of all other users on all items to predict a new rating. For instance, if someone rated a few books, Collaborative Filtering aims at estimating the ratings he would have given to thousands of other books by using the ratings of all the other readers. CF is often preferred to CBF because it wins the agnostic vs. studied contest: CF only relies on the ratings of the users while CBF requires advanced engineering on items to well perform [Lops2011].
The most successful approach in CF is to retrieve potential latent factors from the sparse matrix of ratings. Book latent factors are likely to encapsulate the book genre (spy novel, fantasy, etc.) or some writing styles. Common latent factor techniques compute a lowrank rating matrix by applying Singular Value Decomposition through gradient descent [Koren2009] or Regularized Alternating Least Square algorithm [Zhou2008]. However, these methods are linear and cannot catch subtle factors. Newer algorithms were explored to face those constraints such as Factorization Machines [Rendle2010]. More recent works combine several lowrank matrices such as Local Low Rank Matrix Approximation [Lee2013] or WEMAREC [Chen2015] to enhance the recommendation.
Another limitation of CF is known as the cold start problem: how to recommend an item to a user when no rating exists for neither the user nor the item? To overcome this issue, one idea is to build a hybrid model mixing CF and CBF where side information is integrated into the training process. The goal is to supplant the lack of ratings through side information. A successful approach [Adams2010, Porteous2010] extends the Bayesian Probabilistic Matrix Factorization Framework [Salakhutdinov2008] to integrate side information. However, recent algorithms outperform them in the general case [Lee2012].
In this paper we introduce a CF approach based on Stacked Denoising Autoencoders [Vincent2010] which tackles both challenges: learning a nonlinear representation of users and items, and alleviating the cold start problem by integrating side information. Compared to previous attempts in that direction [Salakhutdinov2007, Sedhain2015, Strub2015, Dziugaite2015, Wu2016], our framework integrates the sparse matrix of ratings and side information in a unique Network. This joint model leads to a scalable and robust approach which beats stateoftheart results in CF. Reusable source code is provided in Torch to reproduce the results. Last but not least, we show that CF approaches based on Matrix Factorization have a strong link with our approach.
The paper is organized as follows. First, Sec. 2 summarizes the stateoftheart in CF and Neural Networks. Then, our model is described in Sec. 3 and 4 and its relation with Matrix Factorization is characterized in Sec. 3.2. Finally, experimental results are given and discussed in Sec. 5 and Sec. 6 discusses algorithmic aspects.
2 Preliminaries
2.1 Denoising Autoencoders
The proposed approach builds upon Autoencoders which are feedforward Neural Networks popularized by Kramer [Kramer1991]. They are unsupervised Networks where the output of the Network aims at reconstructing the initial input. The Network is constrained to use narrow hidden layers, forcing a dimensionality reduction on the data. The Network is trained by backpropagating the squared error loss on the reconstruction. Such Networks are divided into two parts:

the encoder : ,

the decoder : ,
with the input, the output, the size of the Autoencoder’s bottleneck (), and the weight matrices, and the bias vectors, and a nonlinear transfer function. The full Autoencoder will be denoted .
Recent work in Deep Learning advocates to stack pretrained encoders to initialize Deep Neural Networks [Glorot2010]. This process enables the lowest layers of the Network to find lowdimensional representations. It experimentally increases the quality of the whole Network. Yet, classic Autoencoders may degenerate into identity Networks and they fail to learn the latent relationship between data. [Vincent2010] tackle this issue by corrupting inputs, pushing the Network to denoise the final outputs. One method is to add Gaussian noise on a random fraction of the input. Another method is to mask a random fraction of the input by replacing them with zero. In this case, the Denoising AutoEncoder (DAE) loss function is modified to emphasize the denoising aspect of the Network. The loss is based on two main hyperparameters , . They balance whether the Network would focus on denoising the input () or reconstructing the input ():
where is a corrupted version of the input , is the set of corrupted elements in , and is the output of the Network while fed with .
2.2 Matrix Factorization
One of the most successful approach of Collaborative Filtering is Matrix Factorization [Koren2009]. This method retrieves latent factors from the ratings given by the users. The underlying idea is that key features are hidden in the ratings themselves. Given users and items, the rating is the rating given by the user for the item. It entails a sparse matrix of ratings . In Collaborative Filtering, sparsity is originally produced by missing values rather than zero values. The goal of Matrix Factorization is to find a low rank matrix where with and two matrices of rank encoding a dense representation of the users/items. In it simplest form, ( ,) is the solution of
where is the set of indices of known ratings of , (, ) are the dense and low rank rows of ( ,) and is the Frobenius norm. Vectors and are treated as columnvectors.
2.3 Related Work
Neural Networks have attracted little attention in the CF community. In a preliminary work, [Salakhutdinov2007] tackled the Netflix challenge using Restricted Boltzmann Machines but little published work had follow [Phung2009]. While Deep Learning has tremendous success in image and speech recognition [Lecun2015], sparse data has received less attention and remains a challenging problem for Neural Networks.
Nevertheless, Neural Networks are able to discover nonlinear latent variables with heterogeneous data [Lecun2015] which makes them a promising tool for CF. [Sedhain2015, Strub2015, Dziugaite2015] directly train Autoencoders to provide the best predicted ratings. Those methods report excellent results in the general case. However, the cold start initialization problem is ignored. For instance, AutoRec [Sedhain2015] replaces unpredictable ratings by an arbitrary selected score. In our case, we apply a training loss designed for sparse rating inputs and we integrate side information to lessen the cold start effect.
Other contributions deal with this cold start problem by using Neural Networks properties for CBF: Neural Networks are first trained to learn a feature representation from the item which is then processed by a CF approach such as Probabilistic Matrix Factorization [Mnih2007] to provide the final rating. For instance, [Glorot2011, Wang2014] respectively autoencode bagofwords from restaurant reviews and movie plots, [Li2015] autoencode heterogeneous side information from users and items. Finally, [Van2013, Wang2014b] use Convolutional Networks on music samples. In our case, side information and ratings are used together without any unsupervised pretreatment.
2.4 Notation
In the rest of the paper, we will use the following notations:

, are the sparse rows/columns of ;

, are corrupted versions of , ;

, are dense estimates of ;

, are dense low rank representations of , .
3 Autoencoders and CF
User preferences are encoded as a sparse matrix of ratings . A user is represented by a sparse line and an item is represented by a sparse column . The Collaborative Filtering objective can be formulated as: turn the sparse vectors /, into dense vectors /.
We propose to perform this conversion with Autoencoders. To do so, we need to define two types of Autoencoders:

UCFN is defined as ,

VCFN is defined as .
The encoding part of these Autoencoders aims at building a lowrank dense representation of the sparse input of ratings. The decoding part aims at predicting a dense vector of ratings from the lowrank dense representation of the encoder. This new approach differs from classic Autoencoders which only aim at reconstructing/denoising the input. As we will see later, the training loss will then differ from the evaluation one.
3.1 Sparse Inputs
There is no standard approach for using sparse vectors as inputs of Neural Networks. Most of the papers dealing with sparse inputs get around by precomputing an estimate of the missing values [Tresp1994, Bishop1995]. In our case, we want the Autoencoder to handle this prediction issue by itself. Such problems have already been studied in industry [Miranda2012] where 5% of the values are missing. However in Collaborative Filtering we often face datasets with more than 95% missing values. Furthermore, missing values are not known during training in Collaborative Filtering which makes the task even more difficult.
Our approach includes three ingredients to handle the training of sparse Autoencoders:

inhibit the edges of the input layers by zeroing out values in the input,

inhibit the edges of the output layers by zeroing out backpropagated values,

use a denoising loss to emphasize rating prediction over rating reconstruction.
One way to inhibit the input edges is to turn missing values to zero. To keep the Autoencoder from always returning zero, we also use an empirical loss that disregards the loss of unknown values. No error is backpropagated for missing values. Therefore, the error is backpropagated for actual zero values while it is discarded for missing values. In other words, missing values do not bring information to the Network. This operation is equivalent to removing the neurons with missing values described in [Salakhutdinov2007, Sedhain2015]. However, Our method has important computational advantages because only one Neural Networks is trained whereas other techniques has to share the weights among thousands of Networks.
Finally, we take advantage of the masking noise from the Denoising AutoEncoders (DAE) empirical loss. By simulating missing values in the training process, Autoencoders are trained to predict them. In Collaborative Filtering, this prediction aspect is actually the final target. Thus, emphasizing the prediction criterion turns the classic unsupervised training of Autoencoders into a simulated supervfigureised learning. By mixing both the reconstruction and prediction criteria, the training can be thought as a pseudosemisupervised learning. This makes the DAE loss a promising objective function. After regularization, the final training loss is:
where are the indices of known values of , is the flatten vector of weights of the Network and is the regularization hyperparameter. The full forward/backward process is explained in Figure 1. Importantly, Autoencoders with sparse inputs differs from sparseAutoencoders [Lee2006] or Dropout regularization [Srivastava2014] in the sense that Sparse Autoencoders and Droupout inhibit the hidden neurons for regularization purpose. Every inputs/outputs are also known.
3.2 Low Rank Matrix Factorization
Autoencoders are actually strongly linked with Matrix Factorization. For an Autoencoder with only one hidden layer and no output transfer function, the response of the network is where are the weights matrices and the bias terms. Let be the representation of the user , then we recover a predicted vector of the form :
Symmetrically, has the form :
The difference with standard Low Rank Matrix Factorization stands in the definition of /. For the Matrix Factorization by ALS, is iteratively built by solving for each row of (resp. column of ) a linear least square regression using the known values of the row of (resp. column of ) as observations of a scalar product in dimension of and the corresponding columns of (resp. and the corresponding rows of ). An Autoencoder aims at a projection in dimension composed with the non linearity . This process corresponds to a non linear matrix factorization.
Note that CFN also breaks the symmetry between and . For example, while Matrix Factorization approaches learn both and , UCFN learns and only indirectly learns : UCFN targets the function to build whatever the row . A nice benefit is that the learned Autoencoder is able to fill in every vector , even if that vector was not in the training data.
Both nonlinear decompositions on rows and columns are done independently, which means that the matrix learned by UCFN from rows can differ from the concatenation of vectors predicted by VCFN from columns.
Finally, it is very important to differentiate CFN from Restrictive Boltzman Machine (RBM) for Collaborative Filtering [Salakhutdinov2007]. By construction, RBM only handles binary input. Thus, one has to discretize the rating of users/items for both the input/output layers. First, it striclty limits the use of RBM on database with real numbers. Secondly, the resulting weight architecture clearly differs from CFN. in RBM, Imput/output ratings are encoded by weights where is the number of discretized features while CFN only requires a single weight. Thus, no direct link can be done between Matrix Factorization and RBM . Besides, this architecture also prevents RBM from being used to initialize the input/ouput layers of CFN.
4 Integrating side information
Collaborative Filtering only relies on the feedback of users regarding a set of items. When additional information is available for the users and the items, this can sound restrictive. One would think that adding more information can help in several ways: increasing the prediction accuracy, speeding up the training, increasing the robustness of the model, etc. Furthermore, pure Collaborative Filtering suffers from the cold start problem: when very little information is available on an item, Collaborative Filtering will have difficulties recommending it. When bad recommendations are provided, the probability to receive valuable feedback is lowered leading to a vicious circle for new items. A common way to tackle this problem is to add some side information to ensure a better initialization of the system. This is known in the recommendation community as hybridization.
The simplest approach to integrate side information is to append additional user/item bias to the rating prediction [Koren2009]:
where , , are respectively the user, item, and global bias of the Matrix Factorization. Computing these bias can be done through handcrafted engineering or Collaborative Filtering technique. For instance, one method is to extend the dense feature vectors of rank by directly appending side information on them [Porteous2010]. Therefore, the estimated rating is computed by:
where and respectively are a vector representation of side information for the user and for the item. Unfortunately, those methods cannot be directly applied to Neural Networks because Autoencoders optimize and independently. New strategies must be designed to incorporate side information. One notable example was recently made by [Ammar2014] for bitext word alignment.
In our case, the first idea would be to append the side information to the sparse input vector. For simplicity purpose, the next equations will only focus on shallow UAutoencoders with no output transfer functions. Yet, this can be extended to more complex Networks and VAutoencoders. Therefore, we get:
where is a weight matrix.
When no previous rating exist, it enables the Neural Networks to have at an input to predict new ratings. With this scenario, side information is assimilated to pseudoratings that will always exist for every items. However, when the dimension of the Neural Network input is far greater than the dimension of the side information, the Autoencoder may have difficulties to use it efficiently.
Yet, common Matrix Factorization would append side information to dense feature representations rather than sparse feature representation as we just proposed . A solution to reproduce this idea is to inject the side information to every layer inputs of the Network:
where is a weight matrix, are respectively the submatrices of that contain the columns from to and to .
By injecting the side information in every layer, the dynamic Autoencoders representation is forced to integrate this new data. However, to avoid side information to overstep the dense rating representation. Thus, we enforce the following constraint. The dimension of the sparse input must be greater than the dimension of the Autoencoder bottleneck which must be greater than the dimension of the side information ^{1}^{1}1When side information is sparse, the dimension of the side information can be assimilated to the number of nonzero parameters. Therefore, we get:
We finally obtain an Autoencoder which can incorporate side information and be trained through backpropagation. See Figure 2 for a graphical representation of the corresponding network.
5 Experiments
5.1 Benchmark Models
We benchmark CFN with five matrix completion algorithms:

ALSWR (Alternating Least Squares with WeightedRegularization) [Zhou2008] solves the lowrank matrix factorization problem by alternatively fixing and and solving the resulting linear regression problem. Experiments are run with the Apache Mahout^{2}^{2}2http://mahout.apache.org/. We use a rank of 200;

SVDFeature [Chen2012] learns a featurebased matrix factorization: side information are used to predict the bias term and to reweight the matrix factorization. We use a rank of 64 and tune other hyperparameters by random search;

BPMF (Bayesian Probabilistic Matrix Factorization) [Salakhutdinov2008] infers the matrix decomposition after a statistical model. We use a rank of 10;

LLORMA [Lee2013] estimates the rating matrix as a weighted sum of lowrank matrices. Experiments are run with the Prea API^{3}^{3}3http://prea.gatech.edu/. We use a rank of 20, 30 anchor points which entails a global pseudorank of 600. Other hyperparameters are picked as recommended in [Lee2013];

IAutorec [Sedhain2015] trains one Autoencoder per item, sharing the weights between the different Autoencoders. We use 600 hidden neurons with the training hyperparameters recommended by the author.
In every scenario, we selected the highest possible rank which does not lead to overfitting despite a strong regularization. For instance, increasing the rank of BPMF does not significantly increase the final RMSE, idem for SVDFeature. Furthermore, we constrained the algorithms to run in less than two days. Similar benchmarks can be found in the litterature [Li2016, Chen2015, Lee2013].
5.2 Data
Experiments are conducted on MovieLens and Douban datasets. The MovieLens1M, MovieLens10M and MovieLens20M datasets respectively provide 1/10/20 millions discrete ratings from 6/72/138 thousands users on 4/10/27 thousands movies. Side information for MovieLens1M is the age, sex and gender of the user and the movie category (action, thriller etc.). Side information for MovieLens10/20M is a matrix of tags where is the occurrence of the tag for the movie and the movie category. No side information is provided for users.
The Douban dataset [Hao2011] provides 17 million discrete ratings from 129 thousands users on 58 thousands movies. Side information is the bidirectional user/friend relations for the user. The user/friend relation are treated like the matrix of tags from MovieLens. No side information is provided for items.
Pre/postprocessing
For each dataset, the full dataset is considered and the ratings are normalized from 1 to 1. We split the dataset into random 90%10% traintest datasets and inputs are unbiased before the training process: denoting the mean over the training set, the mean of the user and the mean of the item, UCFN and VCFN respectively learn from and . The bias computed on the training set is added back while evaluating the learned matrix.
Side Information
In order to enforce the side information constraint, , Principal Component Analysis is performed on the matrix of tags. We keep the 50 greatest eigenvectors^{4}^{4}4The number of eigenvalues is arbitrary selected. We do not focus on optimizing the quality of this representation. and normalize them by the square root of their respective eigenvalue: given with the diagonal matrix of eigenvalues sorted in descending order, the movie tags are represented by with the number of kept eigenvectors. Binary representation such as the movie category is then concatenated to .
Algorithms  MovieLens1M  MovieLens10M  MovieLens20M  Douban 

BPMF  0.8705 4.3e3  0.8213 6.5e4  0.8123 3.5e4  0.7133 3.0e4 
ALSWR  0.8433 1.8e3  0.7830 1.9e4  0.7746 2.7e4  0.7010 3.2e4 
SVDFeature  0.8631 2.5e3  0.7907 8.4e4  0.7852 5.4e4  * 
LLORMA  0.8371 2.4e3  0.7949 2.3e4  0.7843 3.2e4  0.6968 2.7e4 
IAutorec  0.8305 2.8e3  0.7831 2.4e4  0.7742 4.4e4  0.6945 3.1e4 
UCFN  0.8574 2.4e3  0.7954 7.4e4  0.7856 1.4e4  0.7049 2.2e4 
UCFN++  0.8572 1.6e3  N/A  N/A  0.7050 1.2e4 
VCFN  0.8388 2.5e3  0.7767 5.4e4  0.7663 2.9e4  0.6911 3.2e4 
VCFN++  0.8377 1.8e3  0.7754 6.3e4  0.7652 2.3e4  N/A 
5.3 Error Function
We measure the prediction accuracy by the mean of Root Mean Square Error (RMSE). Denoting the matrix test ratings and the full matrix returned by the learning algorithm, the RMSE is:
where is the number of ratings in the testing dataset. Note that, in the case of Autoencoders, is computed by feeding the network with training data. As such, stands for for UCFN, and for VCFN.
5.4 Training Settings
We train 2layers Autoencoders for MovieLens1/10/20M and the Douban datasets. The layers have from to hidden neurons. Weights are initialized using the fanin rule [LeCun1998]. Transfer functions are hyperbolic tangents. The Neural Network is optimized with stochastic backpropagation with minibatch of size 30 and a weight decay is added for regularization. Hyperparameters^{5}^{5}5Hyperparameters used for the experiments are provided with the source code. are tuned by a genetic algorithm already used by [Mary2007] in a different context.




5.5 Results
Comparison to stateoftheart. Table I displays the RMSE on MovieLens and Douban datasets. Reported results are computed through fold crossvalidation and confidence intervals correspond to a 95% range. Except for the smallest dataset, VCFNs leads to the best results; VCFN is competitive compared to the stateoftheart Collaborative Filtering approaches. To the best of our knowledge, the best result published regarding MovieLens10M (training ratio of 90%/10% and no side information) are reported by [Li2016] and [Chen2015] with a final RMSE of respectively and . However, those two methods require to recompute the full matrix for every new ratings. CFN has the key advantage to provide similar performance while being able to refine its prediction on the fly for new ratings. More generally, we are not aware of recent works that both manage to reach state of the art reslts while successfully integrated side information. For instance, [Kim2014, Kumar2014] reported a global RMSE above on MovieLens10M.
Note that VCFN outperforms UCFN. It suggests that the structure on the items is stronger than the one on users i.e. it is easier to guess tastes based on movies you liked than to find some users similar to you. Of course, the behaviour could be different on some other datasets. The training evoluation is described in the Figure 4.
Impact of side information. At first sight at Table I, the use of side information has a limited impact on the RMSE. This statement has to be mitigated: as the repartition of known entries in the dataset is not uniform, the estimates are biased towards users and items with a lot of ratings. For theses users and movies, the dataset already contains a lot of information, thus having some extra information will have a marginal effect. Users and items with few ratings should benefit more from some side information but the estimation bias hides them.
In order to exhibit the utility of side information, we report in Table II the RMSE conditionally to the number of missing values for items. As expected, the fewer number of ratings for an item, the more important the side information. This is very desirable for a real system: the effective use of side information to the new items is crucial to deal with the flow of new products. A more careful analysis of the RMSE improvement in this setting shows that the improvement is uniformly distributed over the users whatever their number of ratings. This corresponds to the fact that the available side information is only about items. To complete the picture, we train VCFN on MovieLens10M with either the movie genre or the matrix of tags with a training ratio of 90%/10%. Both side information increase the global RMSE by 0.10% while concatenating them increases the final score by 0.14%. Therefore, VCFN handles the heterogeneity of side information.
Impact of the loss. The impact of the denoising loss is highlighted in Table III: the bigger the dataset, the more usefull the de noising loss. On the other side, a network dealing with smaller dataset such as MovieLens1M may suffer from masked entries.
Impact of the nonlinearity. We train ICFN by removing the nonlinearity to study its impact on the training. For fairness, we kept the , , the masking ratio and the number of hidden neurons constant. Furthermore, we search for the best learning rates and L2 regularization throught the genetic algorithm. For movieLens10M, we obtain a final RMSE of 0.8151 1.4e3 which is far worse than classic ICFN.
Impact of the training ratio. Last but not least, CFN remains very robust to a variation of data density as shown in Figure 3. It is all the more impressive that hyperparameters are first optimized for a training/testing ratio of 90%/10%. Coldstart and Warmstart scenario are also far more wellhandled by Neural Networks than more classic CF algorithms. These are highly valuable properties in an industrial context.
6 Remarks
6.1 Source code
Torch is a powerful framework written in Lua to quickly prototype Neural Networks. It is a widely used (Facebook, Deep Mind) industry standard. However, Torch lacks some important basic tools to deal with sparse inputs. Thus, we develop several new modules to deal with DAE loss, sparse DAE loss and sparse inputs on both CPU and GPU. They can easily be plugged into existing code. An outofthebox tutorial is available to run the experiments. The code is freely available on Github^{6}^{6}6https://github.com/fstrub95/Autoencoders_cf and Luarocks ^{7}^{7}7luarocks install nnsparse.
6.2 Scalability
One major problem that most Collaborative Filtering have to solve is scalability since dataset often have hundred of thousands users and items. An efficient algorithm must be trained in a reasonable amount of time and provide quick feedback during evaluation time.
Recent advances in GPU computation managed to reduce the training time of Neural Networks by several orders of magnitude. However, Collaborative Filtering deals with sparse data and GPUs are designed to perform well on dense data. [Salakhutdinov2007, Sedhain2015] face this sparsity constraint by building small dense Networks with shared weights. Yet, this approach may lead to important synchronisation latencies. In our case, we tackle the issue by selectively densifying the inputs just before sending them to the GPUs cores without modification of the result of the computation. It introduces an overhead on the computational complexity but this implementation allows the GPUs to work at their full strength. In practice, vector operations overtake the extra cost. Such approach is an efficient strategy to handle sparse data which achieves a balance between memory footprint and computational time. We are able to train Large Neural Networks within a few minutes as shown in Table IV. For purpose of comparison, on MovieLens20M with a 16 thread 2.7Ghz Core processor, ALSWR (r=20) computes the final matrix within a halfhour with close results, SVDFeatures (r=64) requires a few hours, BPMF (r=10) and IAutorec (r=600) require half a day, ALSWR (r=200) a full day and LLORMA (r=20*30) needs several days with the Prea library. At the time of writing, alternative strategies to train networks with sparse inputs on GPUs are under development. Although, one may complain that CFN benefit from GPU, no other algorithm (except ALSWR) can be easily parallelized on such device. we believe that algorithms that natively work on GPU are auspicious in the light of the progress achieved on GPU.
Dataset  CFN  #Param  Time  Memory 

MLens1M  V  8M  2m03s  250MiB 
MLens10M  V  100M  18m34s  1,532MiB 
MLens20M  V  194M  34m45s  2,905MiB 
MLens1M  U  5M  7m17s  262MiB 
MLens10M  U  15M  34m51s  543MiB 
MLens20M  U  38M  59m35s  1,044Mib 
6.3 Future Works
Implicit feedback may greatly enhance the quality of Collaborative Filtering algorithms [Koren2009, Rendle2010]. For instance, Implicit feedback would be incorporated to CFN by feeding the Network with an additional binary input. By doing so, [Salakhutdinov2007] enhance the quality of prediction for Restricted Boltzmann Machine on the Netflix Dataset. Additionally, ContentBased Techniques with Deep learning such as [Van2013, Wang2014b] would be plugged to CFN. The idea is to train a joint Network that would directly link the raw item features to the ratings such as music, pictures or word representations. As a different topic, VCFN and UCFN sometimes report different errors. This is more likely to happen when they are fed with side information. One interesting work would be to combine a suitable Network that mix both of them. Finally, other metrics exist to estimate the quality of Collaborative Filtering to fit other realworld constraints. Normalized Discounted Cumulative Gain [Jarvelin2002] or Fscore are sometimes preferred to RMSE and should be benchmarked.
7 Conclusion
In this paper, we have introduced a Neural Network architecture, aka CFN, to perform Collaborative Filtering with side information. Contrary to other attempts with Neural Networks, this joint Network integrates side information and learns a nonlinear representation of users or items into a unique Neural Network. This approach manages to beats state of the art results in CF on both MovieLens and Douban datasets. It performs excellent results in both coldstart and warmstart scenario. CFN has also valuable assets for industry, it is scalable, robust and it successfully deals with large dataset. Finally, a reusable source code is provided in Torch and hyperparameters are provided to reproduce the results.
Acknowledgements
The authors would like to acknowledge the stimulating environment provided by SequeL research group, Inria and CRIStAL. This work was supported by French Ministry of Higher Education and Research, by CPER NordPas de Calais/FEDER DATA Advanced data science and technologies 20152020, the Projet CHISTERA IGLU and by FUI Hermès. Experiments were carried out using Grid’5000 tested, supported by Inria, CNRS, RENATER and several universities as well as other organizations.
References
8 Appendix
8.1 Genetic Algorithm
We use the following genetic algorithm [Mary2007] to find the hyperparameters of our model. The crossover of two individuals and gives birth to two new individuals and . The mutation of one individual is obtained by using an isotropic Gaussian law with the mean centred on the current values of parameters and a standard deviation of with the number of individuals and the dimension of the space. Let , , and be such that . Once an initial population of individuals is created, we proceed as follow at each iteration:

We copy best individuals (Set )

We apply the crossover rule to the following best individuals with randomly picked individuals in

We mutate randomly picked individuals in

We generate new individuals
CFN hyperparameters  Probabilistic law1M 

U[0.8,1.2]  
U[0,1]  
masking ratio  U[0,1] 
bottleneck size  U[500,700] 
learning rate  U[0,0.5] 
learning rate decay  U[0,0.5] 
weight decay  U[0,0.5] 