CuratorNet: Visually-aware Recommendation of Art Images

CuratorNet: Visually-aware Recommendation of Art Images


Although there are several visually-aware recommendation models in domains like fashion or even movies, the art domain lacks the same level of research attention, despite the recent growth of the online artwork market. To reduce this gap, in this article we introduce CuratorNet, a neural network architecture for visually-aware recommendation of art images. CuratorNet is designed at the core with the goal of maximizing generalization: the network has a fixed set of parameters that only need to be trained once, and thereafter the model is able to generalize to new users or items never seen before, without further training. This is achieved by leveraging visual content: items are mapped to item vectors through visual embeddings, and users are mapped to user vectors by aggregating the visual content of items they have consumed. Besides the model architecture, we also introduce novel triplet sampling strategies to build a training set for rank learning in the art domain, resulting in more effective learning than naive random sampling. With an evaluation over a real-world dataset of physical paintings, we show that CuratorNet achieves the best performance among several baselines, including the state-of-the-art model VBPR. CuratorNet is motivated and evaluated in the art domain, but its architecture and training scheme could be adapted to recommend images in other areas.

recommender systems, neural networks, visual art

1. Introduction

The big revolution of deep convolutional neural networks (CNN) in the area of computer vision for tasks such as image classification (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; He et al., 2016a), object recognition (Akçay et al., 2016), image segmentation (Badrinarayanan et al., 2017) or scene identification (Sharif Razavian et al., 2014) has reached the area of image recommender systems in recent years (McAuley et al., 2015; He and McAuley, 2016; He et al., 2016b; Lei et al., 2016; Kang et al., 2017; Messina et al., 2018). These works use neural visual embeddings to improve the recommendation performance compared to previous approaches for image recommendation based on ratings and text (Aroyo et al., 2007), social tags (Semeraro et al., 2012), context (Benouaret and Lenne, 2015) and manually crafted visual features (van den Broek et al., 2006). Regarding application domains of recent image recommendation methods using neural visual embeddings, to the best of our knowledge most of them focus on fashion recommendation (McAuley et al., 2015; He and McAuley, 2016; Kang et al., 2017), a few on art recommendation (He et al., 2016b; Messina et al., 2018) and photo recommendation (Lei et al., 2016). He et al. (He et al., 2016b) proposed Vista, a model combining neural visual embeddings, collaborative filtering as well as temporal and social signals for digital art recommendation.

However, digital art projects can differ significantly from physical art (paintings and photographs). Messina et al. (Messina et al., 2018) study recommendation of paintings in an online art store using a simple k-NN model based on neural visual features and metadata. Although memory-based models perform fairly well, model-based methods using neural visual features report better performance (He and McAuley, 2016; He et al., 2016b) in the fashion domain, indicating room for improvement in this area, considering the growing sales in the global online artwork market9.

The most popular model-based method for image recommendation using neural visual embeddings is VBPR (He and McAuley, 2016), a state-of-the-art model that integrates implicit feedback collaborative filtering with neural visual embeddings into a Bayesian Personalized Ranking (BPR) learning framework (Rendle et al., 2009). VBPR performs well, but it has some drawbacks. VBPR learns a latent embedding for each user and for each item, so new users cannot receive suggestions and new items cannot be recommended until re-training is carried out. An alternative is training a model such as Youtube’s Deep Neural Recommender (Covington et al., 2016) which allows to recommend to new users with little preference feedback and without additional model training. However, Youtube’s model was trained on millions of user transactions and with large amounts of profile and contextual data, so it does not easily fit to datasets that are small, with little user feedback or with little contextual and profile data.

In this work, we introduce a neural network for visually-aware recommendation of images focused on visual art named CuratorNet, whose general structure can be seen in Figure 1. CuratorNet leverages neural image embeddings as those obtained from CNNs (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; He et al., 2016a) pre-trained on the Imagenet dataset (ILSVRC (Russakovsky et al., 2015)). We train CuratorNet for ranking with triplets (, , ), where is the history of image preferences of a user , whereas and are a pair of images with higher and lower preference respectively. CuratorNet draws inspiration from VBPR (He and McAuley, 2016) and Youtube’s Recommender System (Covington et al., 2016). VBPR (He and McAuley, 2016) inspired us to leverage pre-trained image embeddings as well as optimizing the model for ranking as in BPR (Rendle et al., 2009). From the work of Convington et al. (Covington et al., 2016) we took the idea of designing a deep neural network that can generalize to new users without introducing new parameters or further training (unlike VBPR which needs to learn a latent user vector for each new user). As a result, CuratorNet can recommend to new users with very little feedback, and without additional training CuratorNet’s deep neural network is trained for personalized ranking using triplets and the architecture contains a set of layers with shared weights, inspired by models using triplet loss for non-personalized image ranking (Schroff et al., 2015; Wang et al., 2014). In these works, a single image represents the input query, but in our case, the input query is a set images representing a user preference history, . In summary, compared to previous works, our main contributions are:

  • a novel neural-based visually-aware architecture for image recommendation,

  • a set of sampling guidelines for the creation of the training dataset (triplets), which improve the performance of CuratorNet as well as VBPR with respect to random negative sampling, and

  • presenting a thorough evaluation of the method against competitive state-of-the-art methods (VisRank (Kang et al., 2017; Messina et al., 2018) and VBPR(He and McAuley, 2016)) on a dataset of purchases of physical art (paintings and photographs).

We also share the dataset10 of user transactions (with hashed user and item IDs due to privacy requirements) as well as visual embeddings of the paintings image files. One aspect to highlight about this research, is that although the triplets’ sampling guidelines to build the BPR training set apply specifically to visual art, the architecture of CuratorNet can be used in other visual domains for image recommendation.

2. Related Work

In this section we provide an overview of relevant related work, considering: Artwork Recommender Systems (2.1), Visually-aware Recommender Systems (2.2), as well as highlights of what differentiates our work to the existing literature.

2.1. Artwork Recommender Systems

With respect to artwork recommender systems, one of the first contributions was the CHIP Project (Aroyo et al., 2007). The aim of the project was to build a recommendation system for the Rijksmuseum. The project used traditional techniques such as content-based filtering based on metadata provided by experts, as well as collaborative filtering based on users’ ratings. Another similar system but non-personalized was by Van den Broek et al. (van den Broek et al., 2006), who used color histograms to retrieve similar art images given a painting as input query.

Another important contribution is the work by Semeraro et al. (Semeraro et al., 2012), who introduced an artwork recommender system called FIRSt (Folksonomy-based Item Recommender syStem) which utilizes social tags given by experts and non-experts of over 65 paintings of the Vatican picture gallery. They did not employ visual features among their methods. Benouaret et al. (Benouaret and Lenne, 2015) improved the state-of-the-art in artwork recommender systems using context obtained through a mobile application, with the aim of making museum tour recommendations more useful. Their content-based approach used ratings given by the users during the tour and metadata from the artworks rated, e.g. title or artist names.

Finally, the most recent works use neural image embeddings (He et al., 2016b; Messina et al., 2018). He et al. (He et al., 2016b) propose the system Vista, which addresses digital artwork recommendations based on pre-trained deep neural visual features, as well as temporal and social data. On the other hand, Messina et al. (Messina et al., 2018) address the recommendation of one-of-a-kind physical paintings, comparing the performance of metadata, manually-curated visual features, and neural visual embeddings. Messina et al. (Messina et al., 2018) recommend to users by computing a simple K-NN based similarity score among users’ purchased paintings and the paintings in the dataset, a method that Kang et al. (Kang et al., 2017) call VisRank.

2.2. Visually-aware Image Recommender Systems

In this section we survey works using visual features to recommend images. We also cite a few works using visual information to recommend non-image items, but these are not too relevant for the present research.

Manually-engineered visual features extracted from images (texture, sharpness, brightness, etc.) have been used in several tasks for information filtering, such as retrieval (Rui et al., 1998; La Cascia et al., 1998; van den Broek et al., 2006) and ranking (San Pedro and Siersdorfer, 2009). More recently, interesting results have been shown for the use of low-level handcrafted stylistic visual features automatically extracted from video frames for content-based video recommendation (Deldjoo et al., 2016). Even better results are obtained when both stylistic visual features and annotated metadata are combined in a hybrid recommender, as shown in the work of Elahi et al. (Elahi et al., 2017). In a visually-aware setting not related to recommending images, Elsweiller et al. (Elsweiler et al., 2017) used manually-crafted attractiveness visual features (San Pedro and Siersdorfer, 2009), in order to recommend healthy food recipes to users.

Another branch of visually-aware image recommender systems focuses on using neural embeddings to represent images (He and McAuley, 2016; He et al., 2016b; Lei et al., 2016; Kang et al., 2017; Messina et al., 2018). The computer vision community has a large track of successful systems based on neural networks for several tasks (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; He et al., 2016a; Akçay et al., 2016; Badrinarayanan et al., 2017; Sharif Razavian et al., 2014). This trend started from the outstanding performance of the AlexNet (Krizhevsky et al., 2012) in the Imagenet Large Scale Visual Recognition challenge (ILSVRC (Russakovsky et al., 2015)), but the most notable implication is that the neural image embeddings have shown impressive performance for transfer learning, i.e., for tasks different from the original one (Kornblith et al., 2018; del Rio et al., 2018). Usually these neural image embeddings are obtained from CNN models such as AlexNet (Krizhevsky et al., 2012), VGG (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016a), among others. Motivated by these results, McAuley et al. (McAuley et al., 2015) introduced an image-based recommendation system based on styles and substitutes for clothing using visual embeddings pre-trained on a large-scale dataset obtained from Later, He et al. (He and McAuley, 2016) went further in this line of research and introduced a visually-aware matrix factorization approach that incorporates visual signals (from a pre-trained CNN) into predictors of people’s opinions, called VBPR. Their training model is based on Bayesian Personalized Ranking (BPR), a model previously introduced by Rendle et al. (Rendle et al., 2009).

The next work by He et al. (He et al., 2016b) deals with visually-aware digital art recommendation, building a model called Vista which combines ratings, temporal and social signals and visual features.

Another relevant work was the research by Lei et al. (Lei et al., 2016) who introduced comparative deep learning for hybrid image recommendation. In this work, they use a siamese neural network architecture for making recommendations of images using user information (such as demographics and social tags) as well as images in pairs (one liked, one disliked) in order to build a ranking model. The approach is interesting, but they work with Flickr photos, not artwork images, and use social tags, not present in our problem setting. The work by Kang et al. (Kang et al., 2017) expands VBPR but they focus on generating images using Generative adversarial networks (Goodfellow et al., 2014) rather than recommending, with an application in the fashion domain. Finally, Messina et al. (Messina et al., 2018) was already mentioned, but we can add that their neural image embeddings outperformed other visual (manually-extracted) and metadata features for ranking, with the exception of the metadata given by user’s favorite artist, which predicted even better than neural embeddings for top@k recommendation.

2.3. Differences to Previous Research

Almost all the surveyed articles on artwork recommendation have in common that they used standard techniques such as collaborative filtering and content-based filtering, as well as manually-curated visual image features, but only the most recent works have exploited visual features extracted from CNNs (He et al., 2016b; Messina et al., 2018). In comparison to these works, we introduce a model-based approach (unlike the memory-based VisRank method by Messina et al. (Messina et al., 2018)) and which recommends to cold-start items and users without additional model training (unlike (He et al., 2016b)). With regards to more general work on visually-aware image recommender systems, almost all of the surveyed articles have focused on tasks different from art recommendation, such as fashion recommendation (McAuley et al., 2015; He and McAuley, 2016; Kang et al., 2017), photo (Lei et al., 2016) and video recommendation (Elahi et al., 2017). Only Vista, the work by He et al. (He et al., 2016b), resembles ours in terms of the topic (art recommendation) and the use of visual features. Unlike them, we evaluate our proposed method, CuratorNet, in a dataset of physical paintings and photographs, not only digital art. Moreover, Vista uses social and temporal metadata which we do not have and many other datasets might not have either. Compared to all these previous research, and to the best of our knowledge, CuratorNet is the first architecture for image recommendation that takes advantage of shared weights in a triplet loss setting, an idea inspired by the results of Wang et al. (Wang et al., 2014) and Schroff et al. (Schroff et al., 2015), but here adapted to the personalized image recommendation domain.

\midrule[\heavyrulewidth]Symbol Description
user set, item set

a specific user

a specific item (resp.)

a positive item and negative item (resp.)

set of all items which the user has expressed a positive preference (full history)

set of all items which the user has expressed a positive preference up to his -th purchase basket (inclusive)

set of all items which the user has expressed a positive preference in his -th purchase basket

Table 1. Notation for CuratorNet.

3. CuratorNet

3.1. Problem Formulation

We approach the problem of recommending art images from user positive-only feedback (e.g., purchase history, likes, etc.) upon visual items (paintings, photographs, etc.). Let and be the set of users and items in a dataset, respectively. We assume only one image per each single item . Considering either user purchases or likes, the set of items for which a user has expressed positive preference is defined as . In this work, we considered purchases to be positive feedback from the user. Our goal is to generate for each user a personalized ranked list of the items for which the user still have not expressed preference, i.e., for .

3.2. Preference Predictor

The preference predictor in CuratorNet is inspired by VBPR (He and McAuley, 2016), a state-of-the-art visual recommender model.

However, CuratorNet has some important differences. First, we do not use non-visual latent factors, so we remove the traditional user and item non-visual latent embeddings. Second, we do not learn a specific embedding per user such as VBPR, but we learn a joint model that, given a user’s purchase/like history, it outputs a single embedding which can be used to rank unobserved artworks in the dataset, similar to YouTube’s Deep Learning network (Covington et al., 2016). Another important difference of VBPR with CuratorNet is that the former has a single matrix to project a visual item embedding into the user latent space. In CuratorNet, we rather learn a neural network to perform that projection, which receives as input either a single image embedding or a set of image embeddings representing users’ purchase/like history . Given all these aspects, the preference predictor of CuratorNet is given by:

Figure 1. Architecture of CuratorNet showing in detail the layers with shared weights for training.

where is an offset, represents a user bias, represents CuratorNet neural network and represents the set of visual embeddings of the images in user history. After some experiments we found no differences between using or not a variable for item bias so we dropped it in order to decrease the number of parameters (Occam’s razor).

Finally, since we calculate the model parameters using BPR (Rendle et al., 2009), the parameters , cancel out (details in the coming subsection) and our final preference predictor is simply


3.3. Model Learning via BPR

We use the Bayesian Personalized Ranking (BPR) framework (Rendle et al., 2009) to learn the model parameters. Our goal is to optimize ranking by training a model which orders triples of the form , where denotes a user, an item with positive feedback from , and an item with non-observed feedback from . The training set of triples is defined as:


Table 1 shows that denotes the set of all items with positive feedback from while shows those items without such positive feedback. Considering our previously defined preference predictor , we would expect a larger preference score of over than over , then BPR defines the difference between scores


an then BPR aims at finding the parameters which optimize the objective function


where is the sigmoid function, includes all model parameters, and is a regularization hyperparameter.

In CuratorNet, unlike BPR-MF (Rendle et al., 2009) and VBPR (He and McAuley, 2016), we use a sigmoid cross entropy loss, considering that we can interpret the decision over triplets as a binary classification problem, where if represents class (triple well ranked, since ) and signifies class (triplet wrongly ranked, since ). Then, CuratorNet loss can be expressed as:


where is the class, includes all model parameters, is a regularization hyperparameter, and is the probability that a user really prefers over , (Rendle et al., 2009), calculated with the sigmoid function, i.e.,


We perform the optimization to learn the parameters which reduce the loss function by stochastic gradient descent with the Adam optimizer (Kingma and Ba, 2015), using the implementation in Tensorflow11. During each iteration of stochastic gradient descent, we sample a user , a positive item (i.e., removed from ), a negative item , and user purchase/like history with item removed, i.e., .

3.4. Model Architecture

The architecture of the CuratorNet neural network is summarized in Figure 1, but is presented with more details in Figure 1. For training, each imput instance is expected to be a triple (,,), where is the set of images in user history (purchases, likes) with a single item removed from the set, is an item with positive preference, and is an item with assumed negative user preference. The negative user preference is assumed since the item is sampled from the list of images which has not interacted with yet. Each image (, and all images ) goes through a ResNet (He et al., 2016a) (pre-trained with ImageNet data), which outputs a visual image embedding in . ResNet weights are fixed during CuratorNet’s training. Then, the network has two layers with scale exponential linear units (hereinafter, SELU (Klambauer et al., 2017)), with 200 neurons each, which reduce the dimensionality of each image. Notice that these two layers work similar to a siamese (Chopra et al., 2005) or triplet loss architecture (Wang et al., 2014; Schroff et al., 2015), i.e., they have shared weights. Each image is represented at the output of this section of the network by a vector in . Then, for the case of the images in , their embeddings are both averaged (average pooling (Boureau et al., 2010)) as well as max-pooled per dimension (max pooling (Boureau et al., 2010)) , and next concatenated to a resultant vector in . Finally, three SELU consecutive layers of 300, 200, and 200 neurons respectively end up with an output representation for in . The final part of the network is a ranking layer which evaluates a loss such that , where replacing in Equation (2), we have . There are several options of loss functions, but due to good results of the cross-entropy loss in similar architectures with shared weights (Koch et al., 2015) rather than, e.g. the hinge loss where we need to optimize an additional margin parameter , we chose the sigmoid cross-entropy for CuratorNet.

Notice that in this article we used a pre-trained ResNet (He et al., 2016a) to obtain the image visual features, but the model could use other CNNs such as AlexNet (Krizhevsky et al., 2012), VGG (Simonyan and Zisserman, 2015), etc. We chose ResNet since it has performed the best in transfer learning tasks (Kornblith et al., 2018; del Rio et al., 2018).

3.5. Data Sampling for Training

The original BPR article (Rendle et al., 2009) suggests the creation of training triples simply by, given a user , randomly sampling a positive element among those consumed, as well as sampling a negative feedback element among those not consumed. However, eventual research has shown that there are more effective ways to create these training triples (Ding et al., 2018). In our case, we define some guidelines to sample triples for the training set based on analyses from previous studies indicating features which provide signals of user preference. For instance, Messina et al. (Messina et al., 2018) showed that people are very likely to buy several artworks with similar visual themes, as well as from the same artist, then we used visual clusters and user’s favorite artist to set some of these sampling guidelines.

Creating Visual Clusters. Some of the sampling guidelines are based on visual similarity of the items, and although we have some metadata for the images in the dataset, there is a significant number of missing values: only 45% of the images have information about subject (e.g., architecture, nature, travel) and 53% about style (e.g., abstract, surrealism, pop art). For this reason, we conduct a clustering of images based on their visual representation, in such a way that items with visual embeddings that are too similar will not be used to sample positive/negative pairs . To obtain these visual clusters, we followed the following procedure: (i) Conduct a Principal Component Analysis to reduce the dimensionality of images embedding vectors from to , (ii) perform k-means clustering with 100 clusters. We conducted k-means clustering 20 times and for each time we calculated the Silhouette coefficient (Rousseeuw, 1987) (an intrinsic metric of clustering quality), so we kept the clustering resulting with the highest Silhouette value. Finally, (iii) we assign each image the label of its respective visual cluster. Samples of our clusters in a 2-dimensional projection map of images, built with the UMAP method (McInnes et al., 2018), can be seen in Figure 2.

Figure 2. Examples of visual clusters automatically generated to sample triples for the training set.

Guidelines for sampling triples. We generate the training set as the union of multiple disjoint12 training sets, each one generated with a different strategy in mind. These strategies and their corresponding training sets are:

  1. Removing item from purchase basket, and predicting this missing item.

  2. Sort items purchased sequentially, and then predict next purchase in basket.

  3. Recommending visually similar artworks from the favorite artists of a user.

  4. Recommending profile items from the same user profile.

  5. Create an artificial user profile of a single item purchased, and recommending profile items given this artificially created user profile.

  6. Create artificial profile with a single item, then recommend visually similar items from the same artist.

Finally, the training set is formally defined as:


In practice, we uniformly sample about 10 million training triples, distributed uniformly among the six training sets . Likewise, we sample about 300,000 validation triples. To avoid sampling identical triples, we hash them and compare the hashes to check for potential collisions. Before sampling the training and validation sets, we hide the last purchase basket of each user, using them later on for testing.

4. Experiments

4.1. Datasets

For our experiments we used a dataset where the user preference is in the form of purchases over physical art (painting and pictures). This private dataset was collected and shared by an online art store. The dataset consists of users, items (paintings and photographs) and purchases. On average, each user bought 2-3 items. One important aspect of this dataset is that paintings are one-of-a-kind, i.e., there is a single instance of each item and once it is purchased, is removed from the inventory. Since most of the items in the dataset are one-of-a-kind paintings (78%) and most purchase transactions have been made over these items (81.7%) a method relying on collaborative filtering model might suffer in performance, since user co-purchases are only possible on photographs. Another notable aspect in the dataset is that each item has a single creator (artist). In this dataset there are 573 artists, who have uploaded 10.54 items in average to the online art store.

The dataset13 with transaction tuples (user, item), as well as the tuples used for testing (the last purchase of each user with at least two purchases) are available for replicating our results as well as for training other models. Due to copyright restrictions we cannot share the original image files, but we share the embeddings of the images obtained with ResNet50 (He et al., 2016a).

4.2. Evaluation Methodology

In order to build and test the models, we split the data into train, validation and test sets. To make sure that we could make recommendations for all cases in the test set, and thus make a fair comparison among recommendation methods, we check that every user considered in the test set was also present in the training set. All baseline methods were trained on the training set with hyperparameters tuned with the validation set.

\topruleMethod (L2 Reg.) AUC R@20 P@20 nDCG@20 R@100 P@100 nDCG@100
Oracle 1.0000 1.0000 .0655 1.0000 1.0000 .0131 1.0000
\midruleCuratorNet .0001 .7204 .1683 .0106 .0966 .3200 .0040 .1246
CuratorNet .001 .7177 .1566 .0094 .0895 .2937 .0037 .1160
VisRank .7151 .1521 .0093 .0956 .2765 .0034 .1195
CuratorNet 0 .7131 .1689 .0100 .0977 .3048 .0038 .1239
CuratorNet .01 .7125 .1235 .0075 .0635 .2548 .0032 .0904
VBPR .0001 .6641 .1368 .0081 .0728 .2399 .0030 .0923
VBPR 0 .6543 .1287 .0078 .0670 .2077 .0026 .0829
VBPR .001 .6410 .0830 .0047 .0387 .1948 .0024 .0620
VBPR .01 .5489 .0101 .0005 .0039 .0506 .0006 .0118
\midruleRandom .4973 .0103 .0006 .0041 .0322 .0005 .0098
Table 2. Results for all methods, sorted by AUC performance. The top five results are highlighted for each metric. For reference, the bottom row presents a random recommender, while the top row presents results of a perfect Oracle.

Next, the trained models are used to report performance over different metrics on the test set. For the dataset, the test set consists of the last transaction from every user that purchased at least twice, the rest of previous purchases are used for train and validation.

Metrics. To measure the results we used several metrics: AUC (also used in (He and McAuley, 2016; He et al., 2016b; Kang et al., 2017)), normalized discounted cumulative gain (nDCG@k)(Järvelin and Kekäläinen, 2002), as well as Precision@k and Recall@k (Cremonesi et al., 2010). Although it might seem counter-intuitive, we calculate these metrics for a low (k=20) as well as high values of k (). Most research on top-k recommendation systems focuses on the very top of the recommendation list, (k=5,10,20). However, Valcarce et al. (Valcarce et al., 2018) showed that top-k ranking metrics measured at higher values of k (k=100, 200) are specially robust to biases such as sparsity and popularity biases. The sparsity bias refers to the lack of known relevance for all the user-items pairs, while the popularity bias is the tendency of popular items to receive more user feedback, then missing user-items are not missing at random. We are specially interested in preventing popularity bias since we want to recommend not only from the artists that each user is commonly purchasing from. We aim at promoting novelty as well as discovery of relevant art from newcomer artists.

4.3. Baselines

The methods used in the evaluation are the following:

  1. CuratorNet: The method described in this paper. We also test it with four regularization values for .

  2. VBPR (He and McAuley, 2016): The state-of-the-art. We used the same embedding size as in CuratorNet (200), we optimized it until converge in the training set and we also tested the four regularization values for .

  3. VisRank (Kang et al., 2017; Messina et al., 2018): This is a simple memory-based content filtering method that ranks a candidate painting for a user based on the maximum cosine similarity with some existing item in the user profile i.e.

Figure 3. The sampling guidelines had a positive effect on AUC compared to random negative sampling for building the BPR training set.

5. Results and Discussion

In Table 2, we can see the results comparing all methods. As reference, at the top rows we present an oracle (perfect ranking), and in the bottom row a random recommender. Notice that AUC for a random recommender should be theoretically =0.5 (sorting pairs of items given a user), so the AUC serves as a check. In terms of AUC, Recall@100, and Precision@100 CuratorNet with a small regularization () is the top model among other methods. We highlight the following points from these results:

  • CuratorNet, with a small regularization , outperforms the other methods in five metrics (AUC, Precision@20, Recall@100, Precision@100 and nDCG@100), while it stands second in Recall@20 and nDCG@20 against the non-regularized version of CuratorNet. This implies that CuratorNet overall ranks very well at top positions, and is specially robust against sparsity and popularity bias (Valcarce et al., 2018). In addition, CuratorNet seems robust to changes in the regularization hyperparameter.

  • Compared to VBPR, CuratorNet is better in all seven metrics (AUC, Precision@20, Recall@100, Precision@100 and nDCG@100). Notably, it is also more robust to the regularization hyperparameter than VBPR. We think that this is explained in part due to the characteristics of the dataset: VBPR exploits non-visual co-occurrance patterns, but in our dataset this signal provides a rather small preference information, since almost 80% are one-of-a-kind items and transactions.

  • VisRank presents very competitive results, specially in terms of AUC, nDCG@20 and nDCG@100, performing better than VBPR in this high one-of-a-kind dataset. However, CuratorNet performs better than VisRank in all metrics. This provides evidence that the model-based approach of CuratorNet that aggregates user preferences into a single embedding is a better approach than the heuristic-based scoring of VisRank.

5.1. Effect of Sampling Guidelines

We studied the effect of using our sampling guidelines for building the training set compared to the traditional BPR setting where negative samples are sampled uniformly at random from the set of unobserved items by the user, i.e., . In the case of CuratorNet we use all six sampling guidelines (), while in VBPR we only used two sampling guidelines ( and ), since VBPR has no notion of session or purchase baskets in its original formulation, and it has more parameters than CuratorNet to model collaborative non-visual latent preferences. We tested AUC in both CuratorNet and VBPR, under their best performance with regularization parameter , with and without our sampling guidelines. Notice that results in Table 2 all consider the use of our sampling guidelines. After conducting pairwise t-tests, we found a significant improvement in CuratorNet and VBPR, as shown in Figure 3. CuratorNet with sampling guidelines (AUC=) had a significant improvement over CuratorNet with random negative sampling (AUC=), . Likewise, VBPR with guidelines (AUC=) had a significant improvement compared with VBPR with random sampling (AUC=), . With this result, we conclude that the proposed sampling guidelines help in selecting better triplets for more effective learning in our art image recommendation setting.

6. Conclusion

In this article we have introduced CuratorNet, an art image recommender system based on neural networks. The learning model of CuratorNet is inspired by VBPR (He and McAuley, 2016), but it incorporates some additional aspects such as layers with shared weights and it works specially well in situations of one-of-a-kind items, i.e., items which disappear from the inventory once consumed, making difficult to user traditional collaborative filtering. Notice that an important contribution of this article are the data shared, since we could not find on the internet any other dataset of user transactions over physical paintings. We have anonymized the user and item IDs and we have provided ResNet visual embeddings to help other researchers building and validating models with these data.

Our model outperforms state-of-the-art VBPR as well as other simple but strong baselines such as VisRank (Kang et al., 2017; Messina et al., 2018). We also introduce a series of guidelines for sampling triples for the BPR training set, and we show significant improvements in performance of both CuratorNet and VBPR versus traditional random sampling for negative instances.

Future Work. Among our ideas for future work, we will test our neural architecture using end-to-end-learning, in a similar fashion than (Kang et al., 2017) who used a light model called CNN-F to replace the pre-trained AlexNet visual embeddings. Another idea we will test is to create explanations for our recommendations based on low-level (textures) and high level (objects) visual features which some recent research are able to identify from CNNs, such as the Network Dissection approach by Bau et al. (Bau et al., 2017). Also, we will explore ideas from the research on image style transfer (Ghiasi et al., 2017; Gatys et al., 2016), which might help us to identify styles and then use this information as context to produce style-aware recommendations. Another interesting idea for future work is integrating multitask learning in our framework, such as the recently published paper on the newest Youtube recommender (Zhao et al., 2019). Finally, from a methodological point-of-view, we will test other datasets with likes rather than purchases, since we aim at understanding how the model will behave under a different type of user relevance feedback.

This work has been supported by the Millennium Institute for Foundational Research on Data (IMFD) and by the Chilean research agency ANID, FONDECYT grant 1191791.


  1. copyright: none
  2. conference: Workshop on Recommendation in Complex Scenarios at the ACM RecSys Conference on Recommender Systems (RecSys 2020); 22 September 2020; Rio de Janeiro, Brazil
  3. doi:
  4. isbn:
  5. price:
  6. ccs: Information systems Recommender systems
  7. ccs: Computing methodologies Machine learning approaches
  8. ccs: Applied computing Media arts
  11. A reference CuratorNet implementation may be found at
  12. Theoretically, these training sets are not perfectly disjoint, but in practice we hash all training triples and make sure no two training triples have the same hash. This prevents duplicates from being added to the final training set.


  1. Transfer learning using convolutional neural networks for object classification within x-ray baggage security imagery. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 1057–1061. Cited by: §1, §2.2.
  2. Personalized museum experience: the rijksmuseum use case. In Proceedings of Museums and the Web, Cited by: §1, §2.1.
  3. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §1, §2.2.
  4. Network dissection: quantifying interpretability of deep visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6541–6549. Cited by: §6.
  5. Personalizing the museum experience through context-aware recommendations. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 743–748. Cited by: §1, §2.1.
  6. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 111–118. Cited by: §3.4.
  7. Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), pp. 539–546. Cited by: §3.4.
  8. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 191–198. Cited by: §1, §1, §3.2.
  9. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems, pp. 39–46. Cited by: §4.2.
  10. Do better imagenet models transfer better… for image recommendation?. In 2nd workshop on Intelligent Recommender Systems by Knowledge Transfer and Learning, External Links: Link Cited by: §2.2, §3.4.
  11. Content-based video recommendation system based on stylistic visual features. Journal on Data Semantics 5 (2), pp. 99–113. Cited by: §2.2.
  12. An improved sampler for bayesian personalized ranking by leveraging view data. In Companion of the The Web Conference 2018 on The Web Conference 2018, pp. 13–14. Cited by: §3.5.
  13. Exploring the semantic gap for movie recommendations. In Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys ’17, pp. 326–330. Cited by: §2.2, §2.3.
  14. Exploiting food choice biases for healthier recipe recommendation. In Proceedings of the 40th international acm sigir conference on research and development in information retrieval, pp. 575–584. Cited by: §2.2.
  15. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423. Cited by: §6.
  16. Exploring the structure of a real-time, arbitrary neural artistic stylization network. arXiv preprint arXiv:1705.06830. Cited by: §6.
  17. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.2.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §1, §2.2, §3.4, §3.4, §4.1.
  19. Vista: a visually, socially, and temporally-aware model for artistic recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 309–316. Cited by: §1, §1, §2.1, §2.2, §2.2, §2.3, §4.2.
  20. VBPR: Visual Bayesian Personalized Ranking from implicit feedback. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 144–150. Cited by: 3rd item, §1, §1, §1, §1, §2.2, §2.3, §3.2, §3.3, item 2, §4.2, §6.
  21. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 422–446. Cited by: §4.2.
  22. Visually-aware fashion recommendation and design with generative image models. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 207–216. Cited by: 3rd item, §1, §2.1, §2.2, §2.2, §2.3, item 3, §4.2, §6, §6.
  23. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.3.
  24. Self-normalizing neural networks. In Advances in Neural Information Processing Systems, pp. 971–980. Cited by: §3.4.
  25. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §3.4.
  26. Do better imagenet models transfer better?. arXiv preprint arXiv:1805.08974. Cited by: §2.2, §3.4.
  27. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in neural information processing systems 25 (NIPS), pp. 1097–1105. Cited by: §1, §1, §2.2, §3.4.
  28. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 24–28. Cited by: §2.2.
  29. Comparative deep learning of hybrid representations for image recommendations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2545–2553. Cited by: §1, §2.2, §2.2, §2.3.
  30. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: §1, §2.2, §2.3.
  31. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints. External Links: 1802.03426 Cited by: §3.5.
  32. Content-based artwork recommendation: integrating painting metadata with neural and manually-engineered visual features. User Modeling and User-Adapted Interaction, pp. 1–40. Cited by: 3rd item, §1, §1, §2.1, §2.2, §2.2, §2.3, §3.5, item 3, §6.
  33. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp. 452–461. Cited by: §1, §1, §2.2, §3.2, §3.3, §3.3, §3.3, §3.5.
  34. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, pp. 53–65. Cited by: §3.5.
  35. Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Transactions on circuits and systems for video technology 8 (5), pp. 644–655. Cited by: §2.2.
  36. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §1, §2.2.
  37. Ranking and classifying attractiveness of photos in folksonomies. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pp. 771–780. Cited by: §2.2.
  38. Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §2.3, §3.4.
  39. A folksonomy-based recommender system for personalized access to digital artworks. Journal on Computing and Cultural Heritage (JOCCH) 5 (3), pp. 11. Cited by: §1, §2.1.
  40. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813. Cited by: §1, §2.2.
  41. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §1, §1, §2.2, §3.4.
  42. On the robustness and discriminative power of information retrieval metrics for top-n recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 260–268. Cited by: §4.2, 1st item.
  43. Multimedia for art retrieval (m4art). In Multimedia Content Analysis, Management, and Retrieval 2006, Vol. 6073, pp. 60730Z. Cited by: §1, §2.1, §2.2.
  44. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393. Cited by: §1, §2.3, §3.4.
  45. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, New York, NY, USA, pp. 43–51. External Links: ISBN 978-1-4503-6243-6, Link, Document Cited by: §6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description