NISER: Normalized Item and Session Representations with Graph Neural Networks
Abstract.
The goal of sessionbased recommendation (SR) models is to utilize the information from past actions (e.g. item/product clicks) in a session to recommend items that a user is likely to click next. Recently it has been shown that the sequence of item interactions in a session can be modeled as graphstructured data to better account for complex item transitions. Graph neural networks (GNNs) can learn useful representations for such sessiongraphs, and have been shown to improve over sequential models such as recurrent neural networks (Wu et al., 2019). However, we note that these GNNbased recommendation models suffer from popularity bias: the models are biased towards recommending popular items, and fail to recommend relevant longtail items (less popular or less frequent items). Therefore, these models perform poorly for the less popular new items arriving daily in a practical online setting. We demonstrate that this issue is, in part, related to the magnitude or norm of the learned item and sessiongraph representations (embedding vectors). We propose a training procedure that mitigates this issue by using normalized representations. The models using normalized item and sessiongraph representations perform significantly better: i. for the less popular longtail items in the offline setting, and ii. for the less popular newly introduced items in the online setting. Furthermore, our approach significantly improves upon existing stateoftheart on three benchmark datasets.
1. Introduction
The goal of sessionbased recommendation (SR) models is to recommend topK items to a user based on the sequence of items clicked so far. Recently, several effective models for SR based on deep neural networks architectures have been proposed (Wu et al., 2019; Liu et al., 2018; Li et al., 2017; Wang et al., 2019). These approaches consider SR as a multiclass classification problem where input is a sequence of items clicked in the past in a session and target classes correspond to available items in the catalog. Many of these approaches use sequential models like recurrent neural networks considering a session as a sequence of item click events (Hidasi et al., 2016; Jannach and Ludewig, 2017; Wang et al., 2019). On the other hand, approaches like STAMP (Liu et al., 2018) consider a session to be a set of items, and use attention models while learning to weigh (attend to) items as per their relevance to predict the next item. Approaches like NARM (Li et al., 2017) and CSRM (Wang et al., 2019) use a combination of sequential and attention models.
An important building block in most of these deep learning approaches is their ability to learn representations or embeddings for items and sessions. Recently, an alternative approach, namely SRGNN (Wu et al., 2019), has been proposed to model the sessions as graphstructured data using GNNs (Li et al., 2015) rather than as sequences or sets, noting that users tend to make complex toandfro transitions across items within a session: for example, consider a session of item clicks by a user. Here, the user clicks on item , then clicks on item and then reclicks on item . This sequence of clicks induces a graph where nodes and edges correspond to items and transitions across items, respectively. For session in the above example, the fact that and are neighbors of in the induced sessiongraph, the representation of can be updated using representations of its neighbors, i.e. and , and thus obtain a more contextaware and informative representations. It is worth noting that this way of capturing neighborhood information has also been found to be effective in neighborhoodbased SR methods such as SKNN (Jannach and Ludewig, 2017) and STAN (Garg et al., 2019).
It is wellknown that more popular items are presented and interactedwith more often on online platforms. This results in a skewed distribution of items clicked by users (Steck, 2011; Abdollahpouri et al., 2017; Yang et al., 2018), as illustrated in Fig. 1. The models trained using the resulting data tend to have popularity bias, i.e. they tend to recommend more popular items over rarely clicked items.
We note that SRGNN (referred to as GNN hereafter) also suffers from popularity bias. This problem is even more severe in a practical online setting where new items are frequently added to the catalog, and are inherently less popular during initial days. To mitigate this problem, we study GNN through the lens of an item and sessiongraph representation learning mechanism, where the goal is to obtain a sessiongraph representation that is similar to the representation of the item likely to be clicked next. We motivate the advantage of restricting the item and sessiongraph representations to lie on a unit hypersphere both during training and inference, and propose NISER: Normalized Item and Session Representations model for SR. We demonstrate the enhanced ability of NISER to deal with popularity bias in comparison to a vanilla GNN model in the offline as well as online settings. We also extend NISER to incorporate the sequential nature of a session via position embeddings (Vaswani et al., 2017), thereby leveraging the benefits of both sequenceaware models (like RNNs) and graphaware models.
2. Related Work
Recent results in computer vision literature, e.g. (Wang et al., 2017; Zheng et al., 2018), indicate the effectiveness of normalizing the final image features during training, and argue in favor of cosine similarity over inner product for learning and comparing feature vectors. (Zheng et al., 2018) introduces the ring loss for soft feature normalization which eventually learns to constrain the feature vectors on a unit hypersphere. Normalizing words embeddings is also popular in NLP applications, e.g. (Peng et al., 2015) proposes penalizing the L norm of word embeddings for regularization. However, to the best of our knowledge, the idea of normalizing item and sessiongraph embeddings or representations has not been explored.
In this work, we study the effect of normalizing the embeddings on popularity bias which has not been established and studied so far. Several approaches to deal with popularity bias exist in the collaborative filtering settings, e.g. (Abdollahpouri et al., 2017; Steck, 2011; Yang et al., 2018). To deal with popularity bias, (Abdollahpouri et al., 2017) introduces the notion of flexible regularization in a learningtorank algorithm. Similarly (Steck, 2011; Yang et al., 2018) uses the powerlaw of popularity where the probability of recommending an item is a smooth function of the items’ popularity, controlled by an exponent factor. However, to the best of our knowledge, this is the first attempt to study and address popularity bias in DNNbased SR using SRGNN (Wu et al., 2019) as a working example. Furthermore, SRGNN does not incorporate the sequential information explicitly to obtain the sessiongraph representation. We study the effect of incorporating position embeddings (Vaswani et al., 2017) and show that it leads to minor but consistent improvement in recommendation performance.
3. Problem Definition
Let denote all past sessions, and denote the set of items observed in the set . Any session is a chronologically ordered tuple of itemclick events: , where each of the itemclick events () corresponds to an item in , and denotes the position of the item in the session . A session can be modeled as a graph , where is a node in the graph. Further, is a directed edge from to . Given , the goal of SR is to predict the next item by estimating the dimensional probability vector corresponding to the relevance scores for the items. The items with highest scores constitute the topK recommendation list.
4. Learning Item and Session Representations
Each item is mapped to a dimensional vector from the trainable embedding lookup table or matrix such that each row is the dimensional representation or embedding^{1}^{1}1We use the terms representation and embedding interchangeably. vector corresponding to item (). Consider any function (e.g. a neural network as in (Liu et al., 2018; Wu et al., 2019))—parameterized by —that maps the items in a session to session embedding , where^{2}^{2}2To ensure same dimensions of across sessions, we can pad with a dummy vector times. . Along with as an input which considers as a sequence, we also introduce an adjacency matrix to incorporate the graph structure. We discuss this in more detail later in Section 5.
The goal is to obtain that is close to the embedding of the target item , such that the estimated index/class for the target item is with . In a DNNbased model , this is approximated via a differentiable softmax function such that the probability of next item being is given by:
(1) 
For this way classification task, softmax (crossentropy) loss is used during training for estimating by minimizing the sum of over all training samples, where is a 1hot vector with corresponding to the correct (target) class .
We next introduce the radial property (Wang et al., 2017; Zheng et al., 2018) of softmax loss, and then use it to motivate the need for normalizing the item and session representations in order to reduce popularity bias.
4.1. Radial Property of Softmax Loss
It is wellknown that optimizing for the softmax loss leads to a radial distribution of features for a target class (Wang et al., 2017; Zheng et al., 2018): If , then it is easy to show that
(2) 
for any . This, in turn, implies that softmax loss favors large norm of features for easily classifiable instances. We omit details for brevity and refer the reader to (Wang et al., 2017; Zheng et al., 2018) for a thorough analysis and proof. This means that a high value of can be attained by multiplying vector by a scalar value ; or simply by ensuring a large norm for the item embedding vector.
4.2. Normalizing the Representations
We note that the radial property has an interesting implication in the context of popularity bias: target items that are easier to predict are likely to have higher L norm. We illustrate this with the help of an example: Items that are popular are likely to be clicked more often, and hence the trained parameters and should have values that ensure these items get recommended more often. The radial property implies that for a given input and a popular target item , a correct classification decision can be obtained as follows: learn the embedding vector with high such that the inner product (where is the angle between the item and session embeddings) is likely to be high to ensure large value for (even when is not small enough and is not large enough). When analyzing the item embeddings from a GNN model (Wu et al., 2019), we observe that this is indeed the case: As shown in Fig. 3, items with high popularity have high L norm while less popular items have significantly lower L norm. Further, performance of GNN (depicted in terms of Recall@20) degrades as the popularity of the target item decreases.
4.3. Niser
Based on the above observations, we consider to minimize the influence of embedding norms in the final classification and recommendation decision. We propose optimizing for cosine similarity as a measure of similarity between item and session embeddings instead of the abovestated inner product. Therefore, during training as well as inference, we normalize the item embeddings as , and use them to get . The session embedding is then obtained as , and is similarly normalized to to enforce a unit norm. The normalized item and session embeddings are then used to obtain the relevance score for next clicked item computed as,
(3) 
Note that the cosine similarity is restricted to . As shown in (Wang et al., 2017), this implies that the softmax loss is likely to get saturated at high values for the training set: a scaling factor is useful in practice to allow for better convergence.
5. Leveraging NISER with GNN
We consider learning the representations of items and sessiongraphs with GNNs where the sessiongraph is represented by as introduced in Section 3. Consider two adjacency matrices and corresponding to the incoming and outgoing edges in graph as illustrated in Fig. 2. Each edge has a normalized weight calculated as the occurrence of the edge divided by the outdegree of the start node of that edge.
GNN takes adjacency matrices and , and the normalized item embeddings as input, and returns an updated set of embeddings after iterations of message propagation across vertices in the graph using gated recurrent units (Li et al., 2015): , where represents the parameters of the GNN function . For any node in the graph, the current representation of the node and the representations of its neighboring nodes are used to iteratively update the representation of the node times. More specifically, the representation of node in the th message propagation step is updated as follows:
(4)  
(5)  
(6)  
(7)  
(8) 
where , depicts the th row of and respectively, , , and are trainable parameters, is the sigmoid function, and is the elementwise multiplication operator.
To incorporate sequential information of item interactions, we optionally learn position embeddings and add them to item embeddings to effectively obtain positionaware item (and subsequently session) embeddings. The final embeddings for items in a session are computed as , where is embedding vector for position obtained via a lookup over the position embeddings matrix , where denotes the maximum length of any input session such that position .
The softattention weight of the th item in session is computed as , where , . The s are further normalized using softmax. An intermediate session embedding is computed as: . The session embedding is a linear transformation over the concatenation of intermediate session embedding and the embedding of most recent item , s.t. , where .
The final recommendation scores for the items are computed as per Eq. 3. Note that while the sessiongraph embedding is obtained using item embeddings that are aware of the sessiongraph and sequence, the normalized item embeddings () independent of a particular session are used to compute the recommendation scores.
6. Experimental Evaluation
Dataset Details:
We evaluate NISER on publicly available benchmark datasets: i) Yoochoose (YC), ii) Diginetica (DN), and iii) RetailRocket (RR).
The YC^{3}^{3}3http://2015.recsyschallenge.com/challege.html dataset is from RecSys Challenge 2015, which contains a stream of user clicks on an ecommerce website within six months.
Given the large number of sessions in YC, the recent 1/4 and 1/64 fractions of the training set are used to form two datasets: YC1/4 and YC1/64, respectively, as done in (Wu
et al., 2019).
The DN^{4}^{4}4http://cikm2016.cs.iupui.edu/cikmcup dataset is transactional data from CIKM Cup 2016 challenge. The RR^{5}^{5}5https://www.dropbox.com/sh/dbzmtq4zhzbj5o9/AACldzQWbwigKjcPTBI6ZPAa?dl=0 dataset is from an ecommerce personalization company retailrocket, which published dataset with six month of user browsing activities.
Offline and Online setting:
We consider two evaluation settings: i. offline and ii. online.
For evaluation in offline setting, we consider static splits of train and test as used in (Wu
et al., 2019) for YC and DN.
For RR, we consider sessions from last 14 days for testing and remaining 166 days for training.
The statistics of datasets are summarized in Table 1.
For evaluation in online setting, we retrain the models every day for 2 weeks (number of sessions per day for YC is much larger, so we evaluate for 1 week due to computational constraints) by appending the sessions from that day to the previous train set, and report the test results of the trained model on sessions from the subsequent day.
NISER and its variants:
We apply our approach over GNN and adapt code^{6}^{6}6https://github.com/CRIPACDIG/SRGNN from (Wu
et al., 2019) with suitable modification described later.
We found that for sessions with length , considering only the most recently clicked 10 items to make recommendations worked consistently better across datasets.
We refer to this variant as GNN+ and use this additional preprocessing step in all our experiments.
Statistics  DN  RR  YC1/64  YC1/4 

#train sessions  0.7 M  0.7 M  0.4 M  0.6 M 
#test sessions  60,858  60,594  55,898  55,898 
#items  43,097  48,759  16,766  29,618 
Average length  5.12  3.55  6.16  5.17 
We consider enhancement over GNN+, and proposed following variants of the embedding normalization approach:

Normalized Item Representations (NIR): only item embeddings are normalized and scale factor is not used^{7}^{7}7 is not restricted to [1,1], in general .,

Normalized Item and Session Representations (NISER): both item and session embeddings are normalized,

NISER+: NISER with position embeddings and dropout applied to input item embeddings.
Hyperparameter Setup:
Following (Wu
et al., 2019),
we use and learning rate of 0.001 with Adam optimizer.
We use 10% of train set which is closer to test test in time as holdout validation set for hyperparameter tuning including scale factor .
We found to work best across most models trained on respective holdout validation set chosen from {}, and hence, we use the same value across datasets for consistency.
We use dropout probability of 0.1 on dimensions of item embeddings in NISER+ across all models.
Since the validation set is closer to the test set in time, therefore, it is desirable to use it for training the models. After finding the best epoch via early stopping based on performance on validation set, we retrain the model for same number of epochs on combined train and validation set.
We train five models for the best hyperparameters with random initialization, and report average and standard deviation of various metrics for all datasets except for YC1/4 where we train three models (as it is a large dataset and takes around 15 hours for training one model).
Evaluation Metrics:
We use same evaluation metrics Recall@K and Mean Reciprocal Rank (MRR@K) as in (Wu
et al., 2019) with .
Recall@K represents the proportion of test instances which has desired item in the topK items.
MRR@K (Mean Reciprocal Rank) is the average of reciprocal ranks of desired item in recommendation list.
Large value of MRR indicates that desired item is in the top of the recommendation list.
For evaluating popularity bias, we consider the following metrics as used in (Abdollahpouri et al., 2019):
Average Recommendation Popularity (ARP): This measure calculates the average popularity of the recommended items in each list given by:
(9) 
where is popularity of item , i.e. the number of times item appears in the training set, is the recommended list of items for session , and is the number of sessions in the test set. An item belongs to the set of longtail items or less popular items if . We evaluate the performance in terms of Recall@20 and MRR@20 for the sessions with target item in the set by varying .
6.1. Results and Observations
Method  DN  RR  YC1/64  YC1/4 

GNN+  495.252.52  453.398.97  4128.5427.80  17898.10126.93 
NISER+  487.310.30  398.533.09  3972.4041.04  16683.52120.74 
Method  DN  RR  YC1/64  YC1/4 
Recall@20  
SKNN (Jannach and Ludewig, 2017)  48.06  56.42  63.77  62.13 
STAN (Garg et al., 2019)  50.97  59.80  69.45  70.07 
GRU4REC (Hidasi et al., 2016)  29.45    60.64  59.53 
NARM (Li et al., 2017)  49.70    68.32  69.73 
STAMP (Liu et al., 2018)  45.64  53.94  68.74  70.44 
GNN (Wu et al., 2019)  51.390.38  57.630.15  70.540.14  70.950.04 
GNN+  51.810.11  58.590.10  70.850.08  71.100.07 
NIR  52.400.06  60.670.08  71.120.05  71.320.11 
NISER  52.630.09  60.850.06  70.860.15  71.690.03 
NISER+  53.390.06  61.410.09  71.270.05  71.800.09 
MRR@20  
SKNN (Jannach and Ludewig, 2017)  16.95  33.16  25.22  24.82 
STAN (Garg et al., 2019)  18.48  35.32  28.74  28.89 
GRU4REC (Hidasi et al., 2016)  8.33    22.89  22.60 
NARM (Li et al., 2017)  16.17    28.63  29.23 
STAMP (Liu et al., 2018)  14.32  28.49  29.67  30.00 
GNN (Wu et al., 2019)  17.790.16  32.740.09  30.800.09  31.370.13 
GNN+  18.030.05  33.290.03  30.840.10  31.510.05 
NIR  18.520.06  35.570.05  30.990.10  31.730.11 
NISER  18.270.10  36.090.03  31.500.11  31.800.12 
NISER+  18.720.06  36.500.05  31.610.02  31.770.10 
Method  DN  RR  YC1/64  YC1/4 

Recall@20  
NISER+  53.390.06  61.410.09  71.270.05  71.800.09 
L norm  52.230.10  59.160.10  71.100.09  71.460.19 
Dropout  52.810.12  60.990.09  71.070.13  71.900.03 
PE  53.11  61.220.03  71.130.04  71.700.11 
MRR@20  
NISER+  18.720.06  36.500.05  31.610.02  31.770.10 
L norm  18.110.05  33.780.04  30.900.07  31.490.07 
Dropout  18.430.11  35.990.02  31.560.06  31.930.17 
PE  18.600.09  36.320.03  31.680.05  31.710.06 
(1) NISER+ reduces popularity bias in GNN+ in : Table 2 shows that ARP for NISER+ is significantly lower than GNN+ indicating that NISER+ is able to recommend less popular items more often than GNN+, thus reducing popularity bias. Furthermore, from Fig. 4, we observe that NISER+ outperforms GNN+ for sessions with less popular items as targets (i.e. when is small), with gains 13%, 8%, 5%, and 2% for DN, RR, YC1/64, and YC1/4 respectively for in terms of Recall@20. Similarly, gains are 28%, 18%, 6%, and 2% in terms of MRR@20. Gains for DN and RR are high as compared to YC. This is due to the high value of . If instead we consider , gains are as high as 26% and 9% for YC1/64 and YC1/4 respectively in terms of Recall@20. Similarly, gains are as high as 34% and 19% in terms of MRR@20. We also note that NISER+ is at least as good as GNN+ even for the sessions with more popular items as targets (i.e. when is large).
(2) NISER+ improves upon GNN+ in online setting for newly introduced items in the set of longtail items . These items have small number of sessions available for training at the end of the day they are launched. From Fig. 5, we observe that for the less popular newly introduced items, NISER+ outperforms GNN+ for sessions where these items are target items on the subsequent day. This proves the ability of NISER+ to recommend new items on very next day, due to its ability to reduce popularity bias. Furthermore, for DN and RR, we observe that during initial days, when training data is less, GNN+ performs poorly while performance of NISER+ is relatively consistent across days indicating potential regularization effect of NISER on GNN models in less data scenarios. Also, as days pass by and more data is available for training, performance of GNN+ improves with time but still stays significantly below NISER+. For YC, as days pass by, the popularity bias becomes more and more severe (as depicted by very small value for , i.e. the fraction of sessions with less popular newly introduced items) such that the performance of both GNN+ and NISER+ degrades with time. However, importantly, NISER+ still performs consistently better than GNN+ on any given day as it better handles the increasing popularity bias.
(3) NISER and NISER+ outperform GNN and GNN+ in offline setting: From Table 3, we observe that NISER+ shows consistent and significant improvement over GNN in terms of Recall and MRR, establishing a new stateoftheart in SR.
We also conduct an ablation study (removing one feature of NISER+ at a time) to understand the effect of each of the following features of NISER+: i. L norm of embeddings, ii. including position embeddings, and iii. applying dropout on item embeddings. As shown in Table 4, we observe that L norm is the most important factor across datasets while dropout and position embeddings contribute in varying degrees to the overall performance of NISER+.
7. Discussion
In this work, we highlighted that the typical itemfrequency distribution with long tail leads to popularity bias in stateoftheart deep learning models such as GNNs (Wu et al., 2019) for sessionbased recommendation. We then argued that this is partially related to the ‘radial’ property of the softmax loss that, in our setting, implies that the norm for popular items will likely be larger than the norm of less popular items. We showed that learning the representations for items and sessiongraphs by optimizing for cosine similarity instead of inner product can help mitigate this issue to a large extent. Importantly, this ability to reduce popularity bias is found to be useful in the online setting where the newly introduced items tend to be less popular and are poorly modeled by existing approaches. We observed significant improvements in overall recommendation performance by normalizing the item and sessiongraph representations and improve upon the existing stateoftheart results. In future, it would be worth exploring NISER to improve other algorithms like STAMP (Liu et al., 2018) that rely on similarity between embeddings for items, sessions, users, etc.
References
 (1)
 Abdollahpouri et al. (2017) Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling popularity bias in learningtorank recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 42–46.
 Abdollahpouri et al. (2019) Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2019. Managing Popularity Bias in Recommender Systems with Personalized Reranking. arXiv preprint arXiv:1901.07555 (2019).
 Garg et al. (2019) Diksha Garg, Priyanka Gupta, Pankaj Malhotra, Lovekesh Vig, and Gautam Shroff. 2019. Sequence and Time Aware Neighborhood for Sessionbased Recommendations: STAN. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, 1069–1072.
 Hidasi et al. (2016) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Sessionbased recommendations with recurrent neural networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR2016).
 Jannach and Ludewig (2017) Dietmar Jannach and Malte Ludewig. 2017. When recurrent neural networks meet the neighborhood for sessionbased recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 306–310.
 Li et al. (2017) Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive sessionbased recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1419–1428.
 Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).
 Liu et al. (2018) Qiao Liu, Yifu Zeng, Refuoe Mokhosi, and Haibin Zhang. 2018. STAMP: shortterm attention/memory priority model for sessionbased recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1831–1839.
 Peng et al. (2015) Hao Peng, Lili Mou, Ge Li, Yunchuan Chen, Yangyang Lu, and Zhi Jin. 2015. A comparative study on regularization strategies for embeddingbased neural networks. arXiv preprint arXiv:1508.03721 (2015).
 Steck (2011) Harald Steck. 2011. Item popularity and recommendation accuracy. In Proceedings of the fifth ACM conference on Recommender systems. ACM, 125–132.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
 Wang et al. (2017) Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. 2017. Normface: L2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1041–1049.
 Wang et al. (2019) Meirui Wang, Pengjie Ren, Lei Mei, Zhumin Chen, Jun Ma, and Maarten de Rijke. 2019. A Collaborative Sessionbased Recommendation Approach with Parallel Memory Modules. In Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, 345–354.
 Wu et al. (2019) Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019. Sessionbased Recommendation with Graph Neural Networks. In Proceedings of the ThirtyThird AAAI Conference on Artificial Intelligence.
 Yang et al. (2018) Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased offline recommender evaluation for missingnotatrandom implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 279–287.
 Zheng et al. (2018) Yutong Zheng, Dipan K Pal, and Marios Savvides. 2018. Ring loss: Convex feature normalization for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5089–5097.