Comparing Neural and Attractiveness-based Visual Features for Artwork Recommendation
Advances in image processing and computer vision in the latest years have brought about the use of visual features in artwork recommendation. Recent works have shown that visual features obtained from pre-trained deep neural networks (DNNs) perform extremely well for recommending digital art. Other recent works have shown that explicit visual features (EVF) based on attractiveness can perform well in preference prediction tasks, but no previous work has compared DNN features versus specific attractiveness-based visual features (e.g. brightness, texture) in terms of recommendation performance. In this work, we study and compare the performance of DNN and EVF features for the purpose of physical artwork recommendation using transactional data from UGallery, an online store of physical paintings. In addition, we perform an exploratory analysis to understand if DNN embedded features have some relation with certain EVF. Our results show that DNN features outperform EVF, that certain EVF features are more suited for physical artwork recommendation and, finally, we show evidence that certain neurons in the DNN might be partially encoding visual features such as brightness, providing an opportunity for explaining recommendations based on visual neural models.
In the latest five years, the area of computer vision has been revolutionized by deep neural networks (DNN), where the use of visual neural embeddings from pre-trained convolutional neural networks has increased by orders of magnitude the performance on tasks such as image classification (Krizhevsky et al., 2012) or scene identification (Sharif Razavian et al., 2014). In the area of recommender systems, a few works have exploited neural visual embeddings for recommendation, such as (McAuley et al., 2015; He et al., 2016; He and McAuley, 2016; Deldjoo et al., 2017). Among these works, we are particularly interested in the use of visual features for recommending art (He et al., 2016). The online artwork market is booming due to the influence of social media and new consumption behaviors of millennials (Weinswig, 2016), but the main works for recommending art date more than 8 years (Aroyo et al., 2007) and they did not utilize visual features for recommendation.
In a recent work, He et al. (He et al., 2016) introduced a recommender method for digital art employing ratings, social features and pre-trained visual DNN embeddings with very good results. However, they did not use explicit visual features (EVF) such as brightness, color or texture. Using only latent embeddings (from DNNs) affects model transparency and explainability of recommendations, which can in turn hinder the user acceptance of the suggestions (Verbert et al., 2013; Konstan and Riedl, 2012). In addition, their work focused on digital art, and in the present work we are interested in recommending physical artworks (paintings). In a more recent work (Messina et al., 2017), we compared the performance of visual DNN features versus art metadata and explicit visual features (EVF: colorfulness, brightness, etc.) for recommendation of physical artworks. We showed that visual features (DNN and EVF) outperform manually-curated metadata. However, we did not analyze which specific visual features are more important in the artwork recommendation task. Furthermore, although previous research have studied what is being encoded by neurons in a DNN (Nguyen et al., 2016; Zeiler and Fergus, 2014), to the best of our knowledge, no previous work has investigated a link between latent DNN visual features and attractiveness visual features such as those investigated by San Pedro et al. (San Pedro and Siersdorfer, 2009) (colorfulness, brightness, etc.). Understanding what visual neural models are encoding could help in transparency and explainability of recommender systems (Verbert et al., 2013).
In this work we compare the performance of 8 explicit visual features (EVF) for the task of recommending physical paintings from an online store, UGallery111http://www.UGallery.com. Our results indicate that the combination of these features offers the best performance, but it also highlights that features like entropy and contrast contribute more than features like colorfulness to the recommendation task. Moreover, an exploratory analysis provides evidence that certain latent features from the DNN visual embedding might be partially related to explicit visual features. This result could be eventually utilized in explaining recommendations with black-box neural visual models.
2. Related Work
Here we survey some works using visual features obtained from pre-trained deep neural networks for recommendation tasks. McAuley et al. (McAuley et al., 2015) introduced an image-based recommendation system based on styles and substitutes for clothing using visual embeddings pre-trained on a large-scale dataset obtained from Amazon.com. Recently, He et al. (He and McAuley, 2016) went further in this line of research and introduced a visually-aware matrix factorization approach that incorporates visual signals (from a pre-trained DNN) into predictors of people’s opinions. The latest work by He et al. (He et al., 2016) deals with visually-aware artistic recommendation, building a model which combines ratings, social signals and visual features. Deldjoo et al. (Deldjoo et al., 2017) compared visual DNN and explicit (stylistic) visual features for movie recommendation, and they found the explicit visual features more informative than DNN features for their recommendation task.
Unlike these previous works, we compare in this article neural and specific explicit visual features (such as brightness or contrast) for physical painting recommendation, and we also explore the potential relation between these two types of features (DNN vs. EVF features).
3. Problem Description & Dataset
The online web store UGallery supports emergent artists by helping them to sell their original paintings online. In this work, we study content-based top-n recommendation based on sets of visual features extracted directly from images. In order to perform personalized recommendations, we create a user profile based on paintings already bought by a user, and based on this model we attempt to predict the next paintings the user will buy. In this work, we focus on comparing different sets of visual features, in order to understand which ones could provide a better recommendation.
Dataset. UGallery shared with us an anonymized dataset of users, paintings (with their respective images) and purchases (transactions) of paintings, where all users have made at least one transaction. In average, each user has bought 2-3 items in the latest years222Our collaborators at UGallery requested us not to disclose the exact dates where the data was collected..
4. Features & Recommendation Method
Since we are using a content-based approach to produce recommendations, we first describe the visual features extracted from images, and then we describe how we used them to make recommendations.
Visual Features. For each image representing a painting in the dataset we obtain features from a pre-trained AlexNet DNN (Krizhevsky et al., 2012), which outputs a vector of dimensions, the fc6 layer. This network was trained with Caffe (Jia et al., 2014) using the ImageNet ILSVRC 2012 dataset (Krizhevsky et al., 2012). We also tested a pre-trained VGG16 (Simonyan and Zisserman, 2014) model, but the results were not significantly different so we do not report them in this article.
We also obtain a vector of explicit visual features of attractiveness, using the OpenCV software library333http://opencv.org/, based on the work of San Pedro et al. (San Pedro and Siersdorfer, 2009): brightness, saturation, sharpness, entropy, RGB-contrast, colorfulness and naturalness. A more detailed description of these features:
Brightness: It measures the level of luminance of an image. For images in the YUV color space, we obtain the average of the luminance component Y.
Saturation: It measures the vividness of a picture. For images in the HSV or HSL color space, we obtain the average of the saturation component S.
Sharpness: It measures how detailed is the image.
Colorfulness: It measures how distant are the colors from the gray color.
Naturalness: It measures how natural is the picture, grouping the pixels in Sky, Grass and Skins pixels and applying the formula in (San Pedro and Siersdorfer, 2009).
RGB-contrast: Measures the variance of luminance in the RGB color space.
Entropy: Shannon’s entropy is calculated, applied to the histogram of values of every pixel in grayscale used as a vector. The histogram is used as the distribution to calculate the entropy.
Local Binary Patterns: Although this is not actually an “explicit” visual feature, it is a traditional baseline in several computer vision tasks (Ojala et al., 1996), so we test it for recommendations too.
Recommendation Method. In order to produce a content-based recommendation list of artworks for a user , we follow the following procedure: For every item in the current inventory we: (1) calculate its similarity to each item in the user’s model, then (2) these single similarities are aggregated into a single score as either the sum or the maximum of all these similarities, and finally (3) the items are sorted by their scores and the top items in the list are recommended. Formally, given a user who has consumed a set of artworks , and an arbitrary artwork from the inventory, the score of this item to be recommended to is:
where is the feature vector embedding of the painting obtained with the Alexnet or from the EVFs. Moreover, the similarity function used was cosine similarity, expressed as:
5. Evaluation Method
Our protocol is based on Macedo et al. (Macedo et al., 2015) to evaluate a recommender system in a time-based manner, and it is presented in Figure 1. We attempt to predict the items purchased in every transaction, where the training set contains all the artworks previously bought by a user just before making the transaction to be predicted. Users who have purchased exactly one artwork were considered as cold start users, and we remove them for this evaluation. After this filtering, our datasets ends up with 365 people who bought more than a single item, who performed a total of 1,629 transactions (i.e. we conducted 1,629 evaluation tests with each algorithm).
|EVF (all features)||.0344||.0459||.0547||.0885||.0127||.0111|
|EVF (all, except LBP)||.0370||.0453||.0585||.0826||.0152||.0109|
Table 1 presents the results, which we summarize in three points:
AlexNet DNN features perform better than those based on EVF, either combined or isolated. Figure 2 presents a sample of the t-SNE map (Maaten and Hinton, 2008) made from the AlexNet DNN visual features, which shows well-defined clusters of images. This result reflects the current state-of-the-art of deep neural networks in computer vision, but as already stated, the lack of transparency of DNN visual features can hinder the user acceptance of these recommendations due to difficulties to explain these recommendations (Konstan and Riedl, 2012; Verbert et al., 2013).
Combining EVF features improves their performance compared to using them isolated, but in some cases it is detrimental. For instance, the combination EVF (all, except LBP) yields better ndcg@5, recall@5 and precision@5 than EVF (all features).
By comparing isolated EVF features, LBP performs the best because it encodes texture patterns and local contrast very well, however its explanation is more complex than image brightness or contrast. Considering only the 7 original features proposed by SanPedro et al. (San Pedro and Siersdorfer, 2009) we observe that entropy, contrast, and naturalness perform consistently well, as well as brightness in terms of precision. On the other side, colorfulness does not seem to have a significant impact on providing good recommendations.
6.1. Relation between DNN and EVF features
We also explored, by a correlation analysis, the potential relation between each of the AlexNet features and the isolated EVFs, results are shown in Table 2. We found that brightness has a significant positive correlation with AlexNet feature () as well as a significant negative correlation with AlexNet feature (). The opposite is found with the feature naturalness, which largest positive and negative correlations are () and (), meaning that no single AlexNet DNN neuron seems to be explicitly encoding naturalness.
Figure 3 shows a plot with correlations of brigthness to all AlexNet dimensions, sorted from the smallest to the largest . This figure also presents sampled images that support this analysis, where two images with high and low brightness, also show respective high and low values in the AlexNet features and the opposite for .
Limitations. It is important to note that this is only an exploratory analysis and generalizability should be considered carefully. Nevertheless, the high correlation of brightness and the small correlation of naturalness provides a hint towards what types of features are not being explicitly encoded in the single neurons of the fc6 layer in the AlexNet DNN.
7. Conclusion and Future Work
In this article we have investigated the impact of latent (DNN) and explicit (EVF) visual features on artwork recommendation. Our results support that DNN features outperform explicit visual features. Although one previous work on movie recommendation found the opposite (Deldjoo et al., 2017) –i.e. EVF being more informative than DNN visual features–, we argue that the domain differences (paintings vs. movies) and the fact that we use images rather than video might explain the difference, but further research is needed in this aspect.
We also show that some EVF contribute more information for the artwork recommendation task, such as entropy, contrast, naturalness and brightness. On the other side, colorfulness is the less informative visual feature for this task.
Moreover, by a correlation analysis we showed that brightness, saturation and entropy are significantly correlated with some DNN embedding features, while naturalness is poorly correlated. This preliminary results should be further studied with the aim of improving the explainability of these black-box models (Verbert et al., 2013).
In future work, we will study the use of other visual embeddings which have shown good results in computer vision tasks, such as GoogleNet (Szegedy et al., 2015). In addition, we will conduct a user study to investigate ways to explain image recommendations using the relations found between visual DNN features and EVF.
Acknowledgements.We thank Alex Farkas and Stephen Tanenbaum from UGallery for sharing the dataset and answering our questions. We also thank Felipe Cortes, PUC Chile student who implemented a web tool to visualize the DNN embedding. Authors Vicente Dominguez, Pablo Messina and Denis Parra are supported by the Chilean research agency Sponsor Conicyt Rl, under Grant Fondecyt Iniciacion No.: Grant #3.
- Aroyo et al. (2007) LM Aroyo, Y Wang, R Brussee, Peter Gorgels, LW Rutledge, and N Stash. 2007. Personalized museum experience: The Rijksmuseum use case. In Proceedings of Museums and the Web.
- Cremonesi et al. (2010) Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, New York, NY, USA, 39–46.
- Deldjoo et al. (2017) Yashar Deldjoo, Massimo Quadrana, Mehdi Elahi, and Paolo Cremonesi. 2017. Using Mise-En-Scene Visual Features based on MPEG-7 and Deep Learning for Movie Recommendation. (2017). arXiv:arXiv:1704.06109
- He et al. (2016) Ruining He, Chen Fang, Zhaowen Wang, and Julian McAuley. 2016. Vista: A Visually, Socially, and Temporally-aware Model for Artistic Recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys ’16). ACM, New York, NY, USA, 309–316. https://doi.org/10.1145/2959100.2959152
- He and McAuley (2016) Ruining He and Julian McAuley. 2016. VBPR: visual Bayesian Personalized Ranking from implicit feedback. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, 144–150.
- Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).
- Jonker and Volgenant (1987) Roy Jonker and Anton Volgenant. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38, 4 (1987), 325–340.
- Konstan and Riedl (2012) Joseph A Konstan and John Riedl. 2012. Recommender systems: from algorithms to user experience. User Modeling and User-Adapted Interaction 22, 1-2 (2012), 101–123.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
- Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579–2605.
- Macedo et al. (2015) Augusto Q Macedo, Leandro B Marinho, and Rodrygo LT Santos. 2015. Context-aware event recommendation in event-based social networks. In Proceedings of the 9th ACM Conference on Recommender Systems. ACM, 123–130.
- Manning et al. (2008) Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. 2008. Introduction to information retrieval. Vol. 1. Cambridge university press Cambridge.
- McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 43–52.
- Messina et al. (2017) Pablo Messina, Vicente Dominguez, Denis Parra, Christoph Trattner, and Alvaro Soto. 2017. Exploring Content-based Artwork Recommendation with Metadata and Visual Features. arXiv preprint arXiv:1706.05786 (2017).
- Nguyen et al. (2016) Anh Nguyen, Jason Yosinski, and Jeff Clune. 2016. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616 (2016).
- Ojala et al. (1996) Timo Ojala, Matti Pietikäinen, and David Harwood. 1996. A comparative study of texture measures with classification based on featured distributions. Pattern recognition 29, 1 (1996), 51–59.
- San Pedro and Siersdorfer (2009) Jose San Pedro and Stefan Siersdorfer. 2009. Ranking and Classifying Attractiveness of Photos in Folksonomies. In Proceedings of the 18th International Conference on World Wide Web (WWW ’09). ACM, New York, NY, USA, 771–780.
- Sharif Razavian et al. (2014) Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806–813.
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
- Verbert et al. (2013) Katrien Verbert, Denis Parra, Peter Brusilovsky, and Erik Duval. 2013. Visualizing recommendations to support exploration, transparency and controllability. In Proceedings of the 2013 international conference on Intelligent user interfaces. ACM, 351–362.
- Weinswig (2016) Deborah Weinswig. 2016. Art Market Cooling, But Online Sales Booming. https://www.forbes.com/sites/deborahweinswig/2016/05/13/art-market-cooling-but-online-sales-booming/. (2016). [Online; accessed 21-March-2017].
- Zeiler and Fergus (2014) Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818–833.