Collaborative Distillation for Top Recommendation
Abstract
Knowledge distillation (KD) is a wellknown method to reduce inference latency by compressing a cumbersome teacher model to a small student model. Despite the success of KD in the classification task, applying KD to recommender models is challenging due to the sparsity of positive feedback, the ambiguity of missing feedback, and the ranking problem associated with the topN recommendation. To address the issues, we propose a new KD model for the collaborative filtering approach, namely collaborative distillation (CD). Specifically, (1) we reformulate a loss function to deal with the ambiguity of missing feedback. (2) We exploit probabilistic rankaware sampling for the topN recommendation. (3) To train the proposed model effectively, we develop two training strategies for the student model, called the teacher and the studentguided training methods, selecting the most useful feedback from the teacher model. Via experimental results, we demonstrate that the proposed model outperforms the stateoftheart method by 2.7–33.2% and 2.7–29.1% in hit rate (HR) and normalized discounted cumulative gain (NDCG), respectively. Moreover, the proposed model achieves the performance comparable to the teacher model.
I Introduction
Neural recommender models [18, 24, 11, 6, 29, 5, 22, 27, 13] have achieved better performance than conventional latent factor models either by capturing nonlinear and complex correlation patterns among users/items, or by leveraging the hidden features extracted from auxiliary information such as texts and images. However, the number of model parameters of neural models is greater than that of conventional models by one or more orders of magnitude. This indicates a tradeoff between accuracy and efficiency. As a result, neural recommender models usually suffer from higher latency during the inference phase.
Our primary goal is to develop a recommender model that achieves a balance between effectiveness and efficiency. In this paper, we employ knowledge distillation (KD) [8] which is a network compression technique by transferring the distilled knowledge of a large model (a.k.a., a teacher model) to a small model (a.k.a., a student model). As the student model can utilize the knowledge transferred from the teacher model, it naturally exhibits the properties of computational efficiency and low memory usage. Therefore, it is capable of achieving a balance between effectiveness and efficiency.
Specifically, the training procedure for KD consists of two steps. In the offline training phase, the teacher model is supervised by a training dataset with labels. Then, the student model is learned to optimize two objectives: matching the label of a training sample (i.e., a hard target) with that of model prediction and matching the label distribution (i.e., a soft target) of the teacher model with that of the student model. In the inference phase, we utilize the student model. Because the teacher model possesses greater modeling power than the student model, the soft target serves as useful additional information for training the student model. The student model trained with KD can perform better than the student model only trained with the training set.
Despite the significant success of KD in the classification task, it is nontrivial to incorporate it into recommender models. More concretely, applying KD to recommender models involves several challenges: (1) Implicit user feedback is extremely sparse. (2) As users only provide positive feedback in implicit datasets, there is inherent ambiguity regarding unknown (or missing) feedback. That is, unknown feedback can be unlabeled positive or negative feedback. Such characteristics naturally require us to distinguish positive/negative feedback from unknown feedback. (3) Because a few topranked items are of interest to topN recommendation, we should consider the degrees of importance of items based on their rankings.
Recently, Tang and Wang [23] proposed a KD model to address the ranking problem, called rank distillation (RD). RD uses only a few items with the highest rankings in the label distribution learned from the teacher model. Then, it manipulates them to positive feedback. In this sense, RD regards the knowledge transferred from the teacher model as augmented positive feedback, which helps alleviate the data sparsity problem associated with topN recommendation.
Although RD improves the prediction accuracy of the student model, it is suboptimal because some vital information in the soft target is ignored. First, the manipulation of the soft target in RD is only involved in generating additional positive feedback with the highest rankings. The key intuition of KD is that various correlations among items can provide additional information. In this regard, manipulating the soft target can distort the meaningful correlation patterns among items. Second, RD simply discards negative feedback with low rankings in the soft target. Removing lowranked items from the soft target can make the process blind to negative user feedback. Therefore, both strategies in RD are counterintuitive to the original idea of KD as they do not maintain the correlations among the items revealed in the soft target.
In this paper, we propose a new knowledge distillation model for collaborative filtering (CF), namely collaborative distillation (CD). Our model enjoys the advantages of both KD and RD. Specifically, the novelty of our model comes from the following aspects.
Reformulating a loss function for CF. We design the CF model by revisiting the ambiguity of data representation. To resolve this issue, we propose a simple but improved CF loss function that only accounts for positive feedback. That is, unknown feedback is explicitly removed from the CF loss function. We claim that the presence of unknown feedback in the CF loss function can bias the prediction of ratings. As common implicit data representations treat unknown feedback as zero at all times, their predictions lean toward zero. By excluding unknown feedback from the CF loss function, we can prevent the prediction bias, thereby improving the overall performance of the student model.
Probabilistic rankaware sampling. Our model is influenced by the idea of RD, treating items differently based on their rankings. In the ranking problem, the higherranked items are more important because they can be potential inclusions in topN recommendation. Therefore, we sample items in the soft target according to their rankings; the higher the ranking, the more the items are sampled. Because we sample both high and lowranked items in a probabilistic manner, our model can learn both positive/negative correlations among items. Therefore, we can take advantage of RD that considers the ranking order of items. Meanwhile, our method effectively overcomes the disadvantage of RD that ignores negative feedback among items. Besides, we rigorously preserve the idea of KD in the proposed model, because the probabilities in the soft target are used without manipulation. This enables us to fully exploit the correlations of items in the soft target. We believe that understanding the hidden correlations of items is crucial to overcome data sparsity and ambiguity problems.
Two training tactics in the student model. Lastly, we develop two training tactics for the student model, called teacher and studentguided methods. The teacherguided method simply provides the soft target with the student model as in the conventional KD. In contrast, the studentguided method actively requests the useful items in the soft target to the teacher model by considering the training status of the student model. In other words, the studentguided method trains the student model by dynamically reflecting its status.
We conduct extensive experiments over the four benchmark datasets – Amazon Music (AMusic), MovieLens 100k (ML100K), Yelp, and Gowalla. Through experimental results, we demonstrate that the proposed model significantly outperforms the stateoftheart model (i.e., RD). Furthermore, the performance of the proposed model is comparable to that of the teacher model.
Ii Preliminaries
In this section, we first introduce the basic notations and formulate the topN recommendation problem. Then, we explain the concept of knowledge distillation (KD) [8] and present rank distillation (RD) [23] that applies knowledge distillation to recommender models.
Problem statement. For a set of users and a set of items , we are given a useritem matrix , where is the implicit user feedback represented by a binary (i.e., positive/negative) value assigned by user to item . If , it indicates known (or observed) feedback, implying positive user experience. Otherwise (i.e., ), it indicates missing (or unobserved) feedback, implying a mixture of unlabeled positive/negative preferences. Such ambiguity has been explicitly discussed in oneclass collaborative filtering (OCCF) [9, 19, 15, 16, 14, 12, 34, 30]. Given user , is the set of items with known positive feedback, and is the set of items with missing feedback.
Our goal is to find a ranked list of the topN items from implicit user feedback. Given user , we need to rank the items (i.e., ) according to their unknown preference scores. To achieve this goal, we define a ranking model with a set of model parameters and compute a predicted preference score for each user and for each item .
The ranking loss function can be categorized into three cases: pointwise, pairwise, and listwise. In this paper, we focus on the pointwise loss, which is usually defined by the negative log likelihood of binary preference scores.
(1) 
where represents the probability of item being preferred by user .
Knowledge distillation (KD). This is a modelindependent knowledge transfer framework designed to deliver the knowledge extracted from a complex teacher model to a simple student model. Many existing studies [32, 25, 28, 26, 21, 17, 31, 7, 33] have utilized KD to compress deep neural networks as well as to achieve stable performances by distilling the knowledge from teacher model. In this process, KD makes use of a logit, called a soft target, which encodes the result values at the last layer before passing to the final activation layer. The success of KD can be attributed to the exploitation of hidden interclass correlations in the soft target. Because the soft target reveals rich information, i.e., positive/negative correlations among items, the student model influenced by soft targets performs better than the same model trained only with groundtruth labels, called a hard target.
The original KD and its variants are mainly developed in the context of the classification problem. They no longer remain valid in the topN recommendation problem because of two reasons. First, many recommender models focus on solving the ranking problem; they aim to find the topN items that the user most prefers. Although the label representation from both recommender and classification models is the same as the binary vector, the significance of each quantity is entirely different. In the classification model, both and are treated equally. Meanwhile, because the topN recommendation model determines highranked items, it needs to place more weights on the responses to .
Second, the recommender model needs to handle the ambiguity of missing feedback. Unlike the classificaton problem, missing feedback in the topN recommendation problem can be either truly negative or hidden positive responses (i.e., preferred but unknown). When the teacher model regards all missing feedback as negative labels, the soft targets may be contaminated and be unable to bring for informative correlations between items. Consequently, the student model using the soft target may have worse performance than the original student model.
Rank distillation (RD). Tang and Wang [23] proposed ranking distillation (RD) that applies KD for ranking models. As depicted in Fig. 1, RD minimizes two losses: a ranking loss with respect to the groundtruth ranking in the training dataset and a distillation loss with respect to the top ranking of unlabeled items.
(2) 
where is the hyperparameter to balance two losses, and and are model parameters for the teacher and student models, respectively. is the ranking loss function for CF models and is the distillation loss function that guides the student model. For both ranking and distillation losses, RD employs the crossentropy function using the negative log likelihood.
(3) 
(4) 
where is the preference probability of user for item in the soft target, indicates the top items with missing feedback predicted by the teacher model, and is the importance weight to each item .
Although RD addresses the ranking problem, it still suffers from several limitations. First, RD regards all missing feedback as negative feedback for the CF loss function in Equation (3). Second, RD only makes use of top ranked items in a deterministic manner, and merely ignores the other items in . As a result, the KD loss function in Equation (4) includes no negative feedback. Lastly, RD modifies realvalued soft targets to positive feedback. The realvalued scores belong to the interval , and encode relative user preferences for items. Unfortunately, because RD quantizes all top ranked items to positive feedback, it loses finer degrees of positive correlations among items learned from the teacher model.
Iii Proposed Model
In this section, we propose a new knowledge distillation model for CF, namely collaborative distillation (CD) (Section IIIA). Specifically, we first formulate a new CF loss function to address the ambiguity of missing feedback (Section IIIB). Second, we design a probabilistic rankaware sampling method to choose the items with unknown feedback (Section IIIC). Lastly, we develop two training strategies for the student model, called teacher and studentguided training strategies, to select the most beneficial label distributions from the teacher model (Section IIID).
Iiia Overview
Fig. 2 describes the overall procedure of our proposed model. Similar to the original KD, the training procedure of our model consists of two steps. First, we train a large teacher model using a training dataset and its hard labels. The teacher model can be either a single CF model or an ensemble model combining multiple CF models. Once the training of the teacher model is completed, we can obtain the predicted preference scores for all unrated items in the teacher model. Then, we utilize both hard and soft targets for training a small student model .
In this process, we face the following challenges. In implicit datasets, user preferences are expressed in the form of positive/negative feedback. However, due to the nature of implicit data, we only observe sparse positive feedback. Besides, unlike the classification problem, the topN recommendation is closely associated with the ranking problem. Therefore, the presence of a few topN items among a block of unknown feedback should be further investigated.
To overcome these challenges, we first suggest an improved CF loss function, which considers the uncertainty in data representation of hard labels. Moreover, we utilize KD as a robust framework to address the ambiguity of unknown feedback. To this end, we extract two valuable pieces of information from the soft target of the teacher model: both positive/negative correlations among items and the candidates for highranked items. Our competitor, RD, manipulates the soft target to extract the rankings of items and uses it to define their KD loss function. As it is desirable that the ranking should reflect in the formulation of the CF, we place greater weight on higherranked items. We thus introduce a probabilistic rankaware sampling method. It is interesting to note that RD has the drawback of sacrificing the precision of positive correlation and ignoring negative correlations between items. Unlike RD, we allow the ranking of items to reflect in KD and both positive and negative correlations to be utilized for the supervision of the student model.
Based on this idea, we reformulate the loss function for our model as a combination of the CF loss function and the KD loss function.
(5) 
In the following sections, we explain the details of each loss function along with their design principle.
IiiB Collaborative Filtering Loss
We design an improved collaborative filtering loss function to overcome the uncertainty of implicit data representation in CF. Both traditional KD and RD suggest computing the loss function from hard labels using the crossentropy between the predicted label distribution and its hard label distribution. This method arguably treats all values of the label distribution equally importantly. However, positive feedback (i.e., ) indicates a preference, whereas negative feedback (i.e., ) can be either or . We argue that, in terms of its confidence, the weight for should be much greater than the weight for . Besides, treating all values equally in the crossentropy loss function induces the model prediction to be biased toward . To prevent this bias in prediction, we devise a selective crossentropy loss function, which only matches items corresponding to .
(6) 
One might ask whether the method of predicting only can be applied to the training of the teacher model. Unfortunately, it always results in the prediction being , which leads to a new bias. Thus, it is undesirable to apply the same strategy for the teacher model. Meanwhile, since the KD loss function provides positive/negative feedback for the training of the student model, the prediction is no longer biased in the proposed CF loss function. As a result, without the concern about training instability, the proposed CF loss function can provide more accurate feedback and eliminate the ambiguity of unknown feedback.
IiiC Knowledge Distillation Loss
We devise a samplingbased KD loss function that not only distinguishes positive/negative feedback from missing feedback but also captures a user’s relative preferences between items.
(7) 
where is a probability converted from the logit and is an item set sampled from . (We will explain how to compute and how to identify later.) The proposed KD loss function is different from that of RD in Equation (4) in terms of two aspects. First, it utilizes the original soft target just like the original KD. That is, it reflects both positive and negative correlations among items in the KD loss function. Second, the proposed loss function is computed by drawing the sampled items in a probabilistic manner.
In this process, the sampling method is critical for the KD loss function. One baseline method can be a random sampling, which learns the soft target regardless of the target values. Random sampling helps to reflect a user’s relative preferences among different items. However, because it does not highlight the items with the highest rankings, it is inappropriate for topN recommendation.
To explain our intuition, we first present several considerations: (1) Most of the items corresponding to unknown feedback represent negative preferences. Therefore, the randomly sampled items are likely to be biased toward negative preferences. (2) Although we can distinguish items with positive preferences from those with missing feedback, they should not be ranked higher than ones with known positive responses. Whereas the items with known positive feedback provide true positive experiences, the inferred positive feedback from the soft target might be incorrect or uncertain. (3) Items with positive scores in the soft target are likely to have positive correlations with the item for that soft target. Although the items with missing feedback might be uncertain, we believe that the soft target is still useful to capture relative preferences among items.
Based on these considerations, we develop a probabilistic rankaware sampling method. The probability of sampling the item from is determined by the ranking order among all unrated items, normalized by the total number of unrated items. We denote the importance of item by .
(8) 
where is the relative ranking position of item in . That is, denotes the highest ranking position and denotes the lowest ranking position.
To draw a sample from unrated items, we investigate a rankaware sampling function. First, we compute the sampling probability using a linear function of the relative ranking position. To implement rankaware linear sampling efficiently, we employ a rejectionbased sampling method using rankings.
(9) 
We can extend it to a nonlinear sampling function using an exponential function. With this adaptation, a few items with topranked positions have a much higher probability of being sampled than the others. The probabilities of the remaining items drop rapidly. This nonlinear sampling function is formulated by:
(10) 
where is the hyperparameter used to control the slope of the exponential function. The value is proportional to the gap between the topranked items and remaining ones.
Algorithm 1 describes the pseudocode for a rankaware linear sampling method. In order to support exponential sampling, we can also modify uniform sampling (line 4). This sampling method is used in our proposed training strategies.
Temperature in the KD loss. One key factor of the original KD [8] is to find a proper balance between the soft targets and hard labels. To tackle this issue, [8] introduces the notion of a temperature . Although the soft target is a useful resource for educating the student model, its distribution is often too sharp. In this case, because the relative correlation among items is not highlighted as much, the impact of KD is less significant.
When a softmax layer is the final output activation layer, it converts the logit to weighted by the temperature.
(11) 
where the temperature is directly proportional to the smoothness of the output probability distribution.
In this paper, we choose the pointwise approach for defining the user preference of items. Pointwise preferences are computed by a logistic function, which is a particular case of the softmax function. The logistic function is used to map a realvalued score to the probability of an item to be preferred () as follows.
(12) 
where is the realvalued logit to user for item .
Whereas the classification algorithm produces class probabilities using the softmax output layer, our problem can be regarded as a binary classification problem for each item. Then, the temperature for the logistic function is adopted using logits.
(13) 
where and are the parameters for the temperature. controls the scale and controls the shift of . Although a more advanced function for temperature can be employed, we choose a relatively simple form of and . We leave the formulation of various temperature functions as our future work.
IiiD Interactive Training Tactics
We investigate two training tactics for the student model: teacherguided training and studentguided training. First, teacherguided training is the process of learning from the teacher model, analogous to conventional KD training. The teacher model delivers the soft target to the student model without considering the training status of the student model. Then, the student model passively learns from those soft targets.
In contrast, the studentguided training takes into account requests from the student model during training. That is, the student model examines its soft targets during training and draws several items according to the probabilistic rankaware sampling method. Then, the student model asks the teacher model for the predictions (i.e., preference scores) of those items. In this way, the student model can instantly update the feedback of the highranked items from the teacher model. This idea is inspired by the idea of an interactive Q&A section in the classroom, where students learn more efficiently by asking questions to their teachers. One may argue that the student model might not create meaningful questions during the early stage of its training. However, stupid questions are still better than random questions. Besides, studentguided training is advantageous because it helps to find an effective path for the training of the student model. We believe that this interactive training tactic helps om the fast convergence of model training and in approaching the improved solution via a better update path.
Note that the proposed training tactics are also used with our sampling method described in Section IIIC. Even with the different sampling method, the adoption of the studentguided training is still suitable. That is, the student model can draw the items by any sampling method and then request for the feedback for those items from the teacher model. The process of each training tactic is as follows:
Teacherguided training. We choose the soft target from the ranking order in the teacher model. Therefore, the teacher model selects the sampled items without the intervention of the student model. Without any interaction between the two models, the student model learns the items selected by the teacher model.
Studentguided training. We choose the items from the soft target of the student model using the probabilistic rankaware sampling method, as depicted in Fig. 3. The student model dynamically selects the sampled items. Specifically, during the training of the student model, the soft target of the student model is analyzed. At each training step, some items are drawn to represent the soft target of the student model. Then, the student model asks the teacher model for predictions corresponding to those selected items. Based on this interaction, the KD loss function is updated. In Algorithm 1, we can replace the teacher model with the student model (line 1) to consider the interactive results between the teacher model and the student model at each training step.
Iv Experiments
Iva Experimental Setup
Datasets. We performed extensive experiments over the public benchmark datasets – Amazon Music^{2}^{2}2http://jmcauley.ucsd.edu/data/amazon/(AMusic), MovieLens 100K^{3}^{3}3https://grouplens.org/datasets/movielens/(ML100K), Yelp^{4}^{4}4https://github.com/hexiangnan/sigir16eals, and Gowalla^{5}^{5}5http://dawenl.github.io/data/gowalla_pro.zip. We converted all ratings to a binary representation; either a user experiences an item positively or does not. These four datasets were selected to span over various degrees of data sparsity. Considering all the observed feedback as positive feedback, our goal was to identify the topN recommendation for implicit datasets. As preprocessing, we filtered out the users who had less than 10 ratings and the items that were rated by less than 5 users. Table I reports the detailed statistics of these datasets.
Dataset  # of users  # of items  # of interactions  Sparsity  min/max/avg. interactions per user  min/max/avg. interactions per item 

AMusic  2,831  13,410  63,054  99.83%  10/714/22.27  1/155/4.70 
ML100K  943  1,682  100,000  93.70%  20/737/106.04  1/583/59.45 
Yelp  9,788  25,373  489,820  99.80%  20/1024/50.04  1/674/19.30 
Gowalla  13,149  14,009  535,650  99.71%  15/764/40.73  15/1743/38.24 
Models  AMusic  ML100K  Yelp  Gowalla  

HR@50  NDCG@50  HR@50  NDCG@50  HR@50  NDCG@50  HR@50  NDCG@50  
CDAE  Teacher  0.1727  0.0547  0.3917  0.1288  0.1150  0.0340  0.3057  0.1269 
Student  0.1217  0.0370  0.3565  0.1107  0.0956  0.0278  0.2632  0.1088  
RD  0.1275  0.0402  0.3578  0.1112  0.0949  0.0272  0.2638  0.1092  
RDRank  0.1238  0.0366  0.3580  0.1110  0.0915  0.0258  0.2602  0.1034  
CDBase  0.1613  0.0498  0.3707  0.1124  0.1042  0.0304  0.2613  0.1093  
CDTG  0.1653  0.0513  0.3805  0.1175  0.1060  0.0309  0.2682  0.1113  
CDSG  0.1681  0.0519  0.3741  0.1182  0.1067  0.0313  0.2710  0.1122  
Gain (%)  31.8  29.1  6.3  6.3  12.4  15.1  2.7  2.7  
Caser  Teacher  0.1366  0.0392  0.3145  0.0868  0.0947  0.0266  0.3005  0.1109 
Student  0.0919  0.0276  0.2717  0.0730  0.0789  0.0220  0.2033  0.0768  
RD  0.0936  0.0276  0.2732  0.0758  0.0814  0.0230  0.2358  0.0877  
RDRank  0.0909  0.0271  0.2787  0.0774  0.0813  0.0232  0.2362  0.0880  
CDBase  0.1211  0.0355  0.3147  0.0872  0.0874  0.0244  0.2557  0.0943  
CDTG  0.1135  0.0336  0.3203  0.0879  0.0899  0.0249  0.2525  0.0904  
CDSG  0.1247  0.0351  0.3196  0.0891  0.0965  0.0269  0.2570  0.0925  
Gain (%)  33.2  28.6  17.2  17.5  18.6  17.0  9.0  7.5 
Competitive models. Since RD [23] is the stateoftheart KD model for topN recommendation, we compare the proposed model with the original RD. Besides, we evaluated two baseline models, RDRank and CDBase, modifying the sampling method for RD and CD, respectively. We also validate the effect of our rankaware sampling in RD and CD.

CDTG: This is our proposed model using teacherguided model training.

CDSG: This is our proposed model using studentguided model training.
To validate the proposed model, we chose two stateoftheart recommender models– CDAE [27] and Caser [22]. (This paper focuses on topN recommender models with pointwise preferences. We leave the evaluation for other models with pairwise preferences, e.g., NPR [13], to future work.) Although the teacher model can be an ensemble model combining multiple models, we focused on verifying our model for the simple case. Finally, the same recommender models having high complexity and low complexity were chosen for the teacher and the student model, respectively. Note that this setting is consistent with existing KD studies.
Evaluation protocol. We adopted the leaveoneout evaluation [6, 29, 5]. Specifically, we heldout the last timestamp useritem interaction as the test data for each user, and the rest of useritem interactions are used for training data. Unlike samplingbased evaluation [6, 29, 5] that randomly chose 100 items from the set of unrated items, we chose all unrated items as the candidate items. We believe that this evaluation protocol is timeconsuming but more thorough.
Evaluation metrics. We measured the accuracy of topN recommendation for two metrics, hit rate (HR) and normalized discounted cumulative gain (NDCG), as done in existing studies [6, 29, 5]. The size of the ranked list was chosen to be 50 for HR@N and NDCG@N, respectively. HR@N examines whether or not the test item is present in the topN list, and NDCG@N places more weights on higherranked items than others in the topN list. In both metrics, the value is proportional to the accuracy of the result. Both metrics are averaged across all users.
Reproducibility. To ensure a fair evaluation, each hyperparameter and regularization term was finetuned and shared among all KD models. We randomly initialized model parameters using Gaussian distribution . Specifically, each baseline CF model had the following hyperparameters.

CDAE [27]: The latent dimensions for the teacher and the student model were 100 and 10, respectively. We set the number of negative sampling items to be 0.5* and the denoising ratio as 0.1. We used the Adagrad optimizer with learning rate = 0.2, l2regularizer = 0.001, and batch size = 256.

Caser [22]: The latent dimensions for the teacher and the student model were 50 and 5, respectively. We set sequence length to be 5, target length to be 1, and the number of negative sampling items to be . We used the Adam optimizer with learning rate = 0.001, l2regularizer = 0.000001, dropout ratio = 0.5, and batch size = 512.
(a) AMusic (CDAE)  (b) AMusic (Caser)  (c) Yelp (CDAE)  (d) Yelp (Caser) 
(a) AMusic (CDAE)  (b) AMusic (Caser)  (c) Yelp (CDAE)  (d) Yelp (Caser) 
(a) AMusic (CDAE)  (b) AMusic (Caser)  (c) Yelp (CDAE)  (d) Yelp (Caser) 
For all KD models, the hyperparameters and were controlled to properly reflect both CF and KD loss functions. For RD and RDRank, we used the public implementation of RD^{6}^{6}6https://github.com/graytowne/rank_distill. Also, the values of most hyperparameters were equal to their default values in public implementation. Note that there is a difference between that appears in RD and CD. Specifically, we used the following parameter settings.

RD [23] and RDRank: We set to be 0.5. For CDAE, the number of items in the soft target was 15. For Caser, the number of items in the soft target was 10.

CDBase, CDTG and CDSG: We set to be 0.5. For CDAE, we set to be 2 and to be 1. For ML100k, we set sampling size to be 0.8*. For other datasets, we set K to be 0.5*. For Caser, we set to be 1, to be 0, and sampling size K to be 50.
Environments. We implemented our model and baseline models using TensorFlow 1.9.0 (CDAE) and PyTorch 1.0.0 (Caser). For Caser, we used the public PyTorch implementation ^{7}^{7}7https://github.com/graytowne/caser_pytorch provided in [23]. All experiments were conducted on a desktop with 128 GB memory and 2 Intel Xeon Processor E52630 v4 (2.20 GHz, 25M cache), and all models were trained using 4 Nvidia GeForce GTX 1080Ti.
IvB Experimental Results
Overall results. Table II reports the performance of several variants of our model (i.e., CDBase, CDTG and CDSG) and baseline models (i.e., RD, and RDRank). Among the four benchmark datasets, we compare KD models with the two baseline CF models. Teacher and student models indicate the baseline CF model with different parameters without KD. Also, the gain indicates how additional accuracy achieved by the proposed model over that of RD [23].
Based on this evaluation, we found several interesting observations. Firstly, both CDTG and CDSG significantly outperform RD over all datasets. Note that the improvement gap for RD is somewhat different from that in [23]. It is because we used leaveoneout evaluation while [23] used crossvalidation evaluation. Our models are consistently better than RD by 2.7–33.2% and 2.7–29.1% in HR and NDCG, respectively. Also, CDBase achieves better accuracy than RD. In this sense, our solution improves the CF loss function and helps boost the performance of topN recommendation.
Secondly, CDTG and CDSG mostly achieve better accuracy than CDBase. This implies that the rankaware sampling method is more appropriate for addressing the ranking problem. However, RDRank tends to be comparable or slightly worse than RD. We conjecture that this is because RD possesses a prediction bias toward negative feedback in the CF loss function. Since selecting top items in RD inherently induces the bias toward positive feedback, the bias in the original CF loss function helps to mitigate the sampling bias in RD. For this reason, RDRank is not as effective as our models. As the CF loss function in CD is not influenced by missing feedback, unlike RD, our models do not compensate for the negative bias by introducing a positive bias. As a result, our sampling strategy in the KD loss function is useful for boosting the prediction accuracy.
Lastly, our models consistently show improvements for two CF models with different architectures; while CDAE is based on the autoencoder for the offline recommendation, Caser is based on convolutional neural networks for a sessionbased recommendation. In particular, our models are most effective in AMusic. This dataset is relatively more sparse than the other datasets, implying that our models effectively overcome the data sparsity problem. Based on these results, we prove that our models can be extended to various CF models as modelagnostic solutions for CF.
Effects of model size. We evaluate the effect of the size of the student model. Fig. 4 depicts the relationship between model size and efficiency. The model size is proportional to the accuracy of our model, as observed in [23] as well. The same tendency consistently holds in different CF models. In both CF models, ones of the small size perform comparably to the teacher model, where the model size is about 20% of the teacher model.
We further investigate the tradeoff between model size and inference time. As depicted in Fig. 5, as the model size increases, it requires greater amounts of inference time. This is reasonable because many model parameters require a higher computational cost. Compared to the teacher model, our models require less than about 50% of the inference time, even though they achieve comparable performances to the teacher model. It can be concluded that our models are capable of transcending the tradeoff between effectiveness and efficiency.
Lastly, Fig. 6 depicts the effects of CDTG and CDSG on each training step. Our models consistently outperform the student model at each step. The number of training steps is inversely proportional to the performance gap between our models and the teacher model. As depicted in Fig. 6(d), CDSG outperforms the teacher model even after 70 epochs.
Effect of model hyperparameters. Fig. 7 depicts NDCG@50 over varying hyperparameter for the KD loss function. In case of CDSG, the best performance is achieved when is around 0.1 in both datasets. Fig. 8 depicts the NDCG for various sampling ratios. For CDAE, we sample items from those with missing feedback. In both datasets, the best performance is achieved when is around 0.5 for CDSG.For CDTG, we also observed a similar tendency for .
For the sampling ratios , CDTG is worse than CDSG when is approximately 0.1–0.5. This implies that CDSG is more effective than CDTG when the sampling ratio is low. In this sense, CDSG is advantageous owing to its nature of leading a better update path for low sampling ratios.
(a) AMusic (CDSG)  (b) Yelp (CDSG) 
(a) AMusic (CDSG)  (b) Yelp (CDSG) 
V Related Work
Model compression techniques. Balancing the effectiveness and efficiency of computational models is an fundamental issue for realworld applications. To address this problem, various techniques [2] have been widely developed to compress cumbersome models into smaller ones. In general, existing work falls into three categories: (1) binarization and discretion, (2) pruning and sharing model parameters, and (3) knowledge distillation (KD).
First, [3, 10] proposed the binary encoding of model parameters. Under this method, realvalued model parameters are discretized via binary representation. Although the discretized model parameters incur the loss of accuracy, it can reduce the memory size and enhance efficiency. Second, the pruning and sharing method presented in [20, 4] removes or binds model parameters which are redundant or have minimal impacts in loss functions. In principle, these research directions focus on designing an efficient inference process using various computational acceleration techniques with low memory usage, thus mostly using modeldependent techniques.
Recently, KD is a modelindependent learning framework that compresses a model by transferring the distilled knowledge of a large teacher model to a small student model. Various KD techniques have been proposed to improve the original KD toward two directions: (1) incorporating more information in addition to utilizing soft targets and (2) analyzing the loss function for KD.
The first trend is based on the intuition that the utilization of soft targets alone is not sufficient because meaningful intermediate information may be ignored during student training. FitNet [17] first pointed out such limitation and suggested using the output of intermediate layers as additional matching criteria. Similarly, [31] utilized the gram matrix of the channel responses from the teacher model as additional information to educate the student model. Net2Net [1] employed model parameters of the teacher model directly to initialize those of the student model. Recently, [32] used the attention map as an additional matching constraint. That is, in addition to the original loss term for matching the soft target, the attention map of the student model should match that of the teacher model. Most recently, [21] further improved the attentionbased method by matching the gradients (i.e., Jacobians) of output activations for the input.
Along an alternative direction, several algorithms focused on analyzing the choice of loss functions for KD. [7] observed that the distancebased loss is inappropriate for transferring activation boundaries, and thus suggested a hinge loss. [28] and [26] employed adversarial learning into the KD framework. Recently, KDGAN [25] bypassed the convergence step of adversarial learning by employing a tripleplayer game [33]. In this study, we develop an improved loss function for KD. Unlike existing models, we mainly focus on modifying the loss function of KD for topN recommendation.
Oneclass collaborative filtering (OCCF). For implicit datasets, handling missing feedback that intrinsically delineates a mixture of positive/negative feedback is a nontrivial issue. To address this challenge, existing studies can be categorized into weightbased, samplingbased, and imputationbased methods. First, the weightbased method [9, 14] regards all missing feedback as negative ones. For instance, Hu et al. [9] and Pan et al. [14] controlled weights as the confidence of negative values with various schemes, such as uniform, useroriented, and itemoriented methods. Second, Paquet and Koenigstein [16] proposed a samplingbased method by considering the degree distributions of users/items in the graph. Lastly, Sindhwani et al. [19] regarded unobserved feedback as optimization variables and imputed missing feedback via optimization. Besides, Li et al. [12] leveraged side information to construct useritem similarity, and Zheng et al. [34] employed multiple similarity matrices between users and items to predict drugtarget interaction. Moreover, Yao et al. [30] proposed dual regularization by combining the weighted and imputationbased methods.
The proposed model is similar to the imputation method because the student model utilizes some inferred values for missing feedback. While the imputationbased method mainly focused on substituting missing feedback to improve the accuracy of topN recommendation using auxiliary information, our model aims to balance the effectiveness and efficiency of topN recommendation in the paradigm of KD.
Vi Conclusion
In this study, we propose a new knowledge distillation model, namely collaborative distillation (CD), with implicit user feedback for topN recommendation. Specifically, we address several challenges raised in topN recommendation: data sparsity, the ambiguity of missing feedback, and the ranking problem. To overcome these challenges, we first introduce a new loss function for CF. Then, we deal with the ranking problem using the rankaware sampling method. In this process, our model utilizes the soft target without manipulation to manage data sparsity. Furthermore, we devise teacher and studentguided training strategies to validate how the active/passive interactions between teacher and student models affect KD performance. Through extensive experiments, we demonstrate that the proposed model significantly outperforms the stateoftheart model and several baseline models.
Acknowledgment
This work was supported by the National Research Foundation of Korea (NRF) grant (No. NRF2018R1A2B6009135 and NRF2019R1A2C2006123) and the Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government (MSIT) (No.2019000421, AI Graduate School Support Program).
References
 [1] (2016) Net2Net: accelerating learning via knowledge transfer. In International Conference on Learning Representations (ICLR), Cited by: §V.
 [2] (2017) A survey of model compression and acceleration for deep neural networks. CoRR abs/1710.09282. Cited by: §V.
 [3] (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), pp. 3123–3131. Cited by: §V.
 [4] (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143. Cited by: §V.
 [5] (2018) Outer productbased neural collaborative filtering. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 2227–2233. Cited by: §I, §IVA, §IVA.
 [6] (2017) Neural collaborative filtering. In International Conference on World Wide Web (WWW), pp. 173–182. Cited by: §I, §IVA, §IVA.
 [7] (2019) Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §II, §V.
 [8] (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. Cited by: §I, §II, §IIIC.
 [9] (2008) Collaborative filtering for implicit feedback datasets. In IEEE International Conference on Data Mining (ICDM), pp. 263–272. Cited by: §II, §V.
 [10] (2016) Binarized neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 4107–4115. Cited by: §V.
 [11] (2016) Convolutional matrix factorization for document contextaware recommendation. In ACM Conference on Recommender Systems (RecSys), pp. 233–240. Cited by: §I.
 [12] (2010) Improving oneclass collaborative filtering by incorporating rich user information. In ACM International Conference on Information and Knowledge Management, (CIKM), pp. 959–968. Cited by: §II, §V.
 [13] (2018) Neural personalized ranking for image recommendation. In ACM International Conference on Web Search and Data Mining (WSDM), pp. 423–431. Cited by: §I, §IVA.
 [14] (2009) Mind the gaps: weighting the unknown in largescale oneclass collaborative filtering. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp. 667–676. Cited by: §II, §V.
 [15] (2008) Oneclass collaborative filtering. In IEEE International Conference on Data Mining (ICDM), pp. 502–511. Cited by: §II.
 [16] (2013) Oneclass collaborative filtering with random graphs. In International World Wide Web Conference (WWW), pp. 999–1008. Cited by: §II, §V.
 [17] (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §II, §V.
 [18] (2015) AutoRec: autoencoders meet collaborative filtering. In International Conference on World Wide Web Companion (WWW), pp. 111–112. Cited by: §I.
 [19] (2010) Oneclass matrix completion with lowdensity factorizations. In IEEE International Conference on Data Mining (ICDM), pp. 1055–1060. Cited by: §II, §V.
 [20] (2015) Datafree parameter pruning for deep neural networks. In British Machine Vision Conference (BMVC), pp. 31.1–31.12. Cited by: §V.
 [21] (2018) Knowledge transfer with jacobian matching. arXiv preprint arXiv:1803.00443. Cited by: §II, §V.
 [22] (2018) Personalized topn sequential recommendation via convolutional sequence embedding. In ACM International Conference on Web Search and Data Mining, (WSDM), pp. 565–573. Cited by: §I, 2nd item, §IVA.
 [23] (2018) Ranking distillation: learning compact ranking models with high performance for recommender system. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, (KDD), pp. 2289–2298. Cited by: §I, Fig. 1, §II, §II, 1st item, 1st item, §IVA, §IVA, §IVB, §IVB, §IVB.
 [24] (2015) Collaborative deep learning for recommender systems. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1235–1244. Cited by: §I.
 [25] (2018) KDGAN: knowledge distillation with generative adversarial networks. In Advances in Neural Information Processing Systems (NIPS), pp. 783–794. Cited by: §II, §V.
 [26] (2018) Adversarial learning of portable student networks. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §II, §V.
 [27] (2016) Collaborative denoising autoencoders for topn recommender systems. In ACM International Conference on Web Search and Data Mining (WSDM), pp. 153–162. Cited by: §I, 1st item, §IVA.
 [28] (2018) Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. In International Conference on Learning Representations Workshop, Cited by: §II, §V.
 [29] (2017) Deep matrix factorization models for recommender systems. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 3203–3209. Cited by: §I, §IVA, §IVA.
 [30] (2014) Dualregularized oneclass collaborative filtering. In ACM International Conference on Conference on Information and Knowledge Management (CIKM), pp. 759–768. Cited by: §II, §V.
 [31] (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7130–7138. Cited by: §II, §V.
 [32] (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §II, §V.
 [33] (2017) Triple generative adversarial nets. In Advances in neural information processing systems (NIPS), Cited by: §II, §V.
 [34] (2013) Collaborative matrix factorization with multiple similarities for predicting drugtarget interactions. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1025–1033. Cited by: §II, §V.