Exquisitor: Interactive Learning at Large
Increasing scale is a dominant trend in today’s multimedia collections, which especially impacts interactive applications. To facilitate interactive exploration of large multimedia collections, new approaches are needed that are capable of learning on the fly new analytic categories based on the visual and textual content. To facilitate general use on standard desktops, laptops, and mobile devices, they must furthermore work with limited computing resources. We present Exquisitor, a highly scalable interactive learning approach, capable of intelligent exploration of the large-scale YFCC100M image collection with extremely efficient responses from the interactive classifier. Based on relevance feedback from the user on previously suggested items, Exquisitor uses semantic features, extracted from both visual and text attributes, to suggest relevant media items to the user. Exquisitor builds upon the state of the art in large-scale data representation, compression and indexing, introducing a cluster-based retrieval mechanism that facilitates the efficient suggestions. With Exquisitor, each interaction round over the full YFCC100M collection is completed in less than 0.3 seconds using a single CPU core. That is 4x less time using 16x smaller computational resources than the most efficient state-of-the-art method, with a positive impact on result quality. These results open up many interesting research avenues, both for exploration of industry-scale media collections and for media exploration on mobile devices.
A dominant trend in multimedia applications for industry and society today is the ever-growing scale of media collections. As the general public has been given tools for unprecedented media production, storage and sharing, media generation and consumption have exceeded all expectations. Furthermore, upcoming multimedia applications in countless domains, from smart urban spaces and business intelligence to health and wellness, lifelogging, and entertainment, increasingly require joint modelling of visual content and text. This vastly increases the number of items that must be dealt with, making scalability an even greater concern (Rudinac et al., 2018). Underlining the importance of collection scale, the multimedia research community created the YFCC100M collection, with associated calls for arms to tackle scalability of multimedia applications (Thomee et al., 2016).
At the same time, the way in which people consume multimedia has been changing drastically. Most multimedia applications today are used on mobile devices with limited computing resources. Nevertheless, the expectation of users is to be able to very efficiently work with their own media collections, as well as with larger external collections. It is therefore imperative that the multimedia community embraces this new trend and provides multimedia tools capable of handling large collections with limited hardware resources.
In this paper, we answer the call for multimedia scalability by addressing the problem of interactive search and exploration in such large collections, using limited hardware resources available to broad audiences. This is a particularly challenging problem, as with very large collections it is difficult for users to form queries that yield satisfactory results.
User relevance feedback, a form of interactive learning, provides an efficient mechanism for addressing various analytic tasks that require alternating between search and exploration. Early work on user relevance feedback, however, suffered from both lack of meaningful representations of the media items and lack of high-dimensional indexing techniques for scaling up the feedback loop. For example, content based image and video retrieval with user relevance feedback were commonly relying on non-semantic low-level visual features, such as colour, texture, shape and edge histograms (Rui et al., 1997), and using inefficient indexing techniques to facilitate processing, such as R-trees and kd-trees (Flickner et al., 1995). Similarly, even linear models for classification, a popular choice in user relevance feedback applications, were not interactive on collections with, e.g., 100K items (Chang and Lin, 2011).
There has been relatively little work on user relevance feedback in the last decade, which recently raised serious concerns in the multimedia community (Schoeffmann et al., 2018). However, recent advances in high-dimensional indexing have yielded approaches supporting standard nearest neighbor search in collections with billions of features. (Gudmundsson et al., 2017; Iscen et al., 2018; Wang et al., 2018; Jiang et al., 2015). Furthermore, advances in data representation, as well as the pressing need for methods to cope with large-scale media collections, clearly imply that the time has come to re-visit interactive learning.
In this paper, we present Exquisitor, the first on-the-fly learning approach capable of interactive exploration of the YFCC100M collection. Overall, our results using the YFCC100M collection show that with the Exquisitor system, the time required to suggest new items in each iteration of the interactive feedback loop is about 4x shorter using 16x fewer computational resources than the state of the art (Zahálka et al., 2018a; Jégou et al., 2011), a performance improvement of nearly two orders of magnitude, while also improving result quality. Our results thus show that interactive learning is clearly feasible with today’s large-scale collections, even on limited hardware.
In this paper, we make the following major contributions:
Exquisitor significantly surpasses state of the art in interactive search and exploration at large by building on the recent advances in several multimedia areas: data representation from deep learning, data compression from interactive learning, and data retrieval from high-dimensional indexing.
Exquisitor introduces a novel approach to retrieval from cluster-based indexing structures, retrieving the -furthest items from the decision boundary of an interactive classifier.
We show, in an experimental evaluation, that suggestions for the interacting user can be produced with sub-second latency using only a single CPU core and very limited memory—hardware resources comparable to today’s high-end mobile devices—without any sacrifices in accuracy.
The remainder of this paper is organized as follows. In Section 2, we analyse state-of-the-art methods in interactive learning from a scalability perspective, setting the stage for the Exquisitor approach. In Section 3, we then present the Exquisitor approach in detail, and analyse its performance in Section 4, before concluding.
2. Related Work
As outlined in the introduction, unlocking the true potential of multimedia collections and providing added value for professional and casual users alike requires joint utilization of interactive learning and high-dimensional indexing. In this section we first describe state of the art in interactive learning. Then, based on the identified advantages and limitations of interactive learning algorithms, we provide a set of requirements that high-dimensional indexing should satisfy for facilitating interactivity on extremely large collections. Finally, we use those requirements for reflecting on the state of the art in high-dimensional indexing.
2.1. Interactive Learning
Often regarded as an exotic machine learning flavour by the theorists, as it does not fit into the strict “supervised-unsupervised-reinforcement” categorization, interactive learning became an essential tool of multimedia researchers from the early days of content-based image and video retrieval (Rui et al., 1997; Huang et al., 2008). While the dominant focus of the research community turned to supervised approaches, which was further encouraged by successes of deep learning, interactive learning nevertheless survived the test of time by providing exploratory access to ever-growing multimedia collections as it facilitates e.g. incorporation of human (expert) knowledge (Beluch et al., 2018; Mironică et al., 2016), the learning of new analytic categories on the fly (Zahálka et al., 2018a; Kovashka et al., 2015) and the training of accurate classifiers with minimal number of labeled samples (Yang et al., 2015; Wang et al., 2017; Beluch et al., 2018). Interactive learning involves the process of retrieving items from a collection and showing them to the user, who in turn judges their relevance based on particular criteria, and then using the obtained relevance judgment for modifying or re-training the classifier on the fly. The process is repeated as long as the user deems fit for her insight gain. Interactive learning comes in two basic forms, active learning and user relevance feedback (Huang et al., 2008).
2.1.1. Active Learning
In active learning, the interacting user annotates samples that will contribute the most to the quality of the final model (Cohn et al., 1996). In practice, this often means annotating the decision boundary between classes or, in other words, the items for which the classifier is most uncertain (Huijser and v. Gemert, 2017). The technique, originally proposed in the 90s, recently experienced a revival in the computer vision and multimedia communities as the means of training data-hungry CNNs when obtaining additional labels is costly or unfeasible due to, e.g., a limited time an expert can spend producing annotations (Beluch et al., 2018). The algorithm is commonly trained using a small number of annotated examples and then a fraction of unlabeled items are retrieved and presented to the user for relevance judgment based on a conveniently selected proxy for uncertainty (Yang et al., 2015; Wang et al., 2017; Beluch et al., 2018). Despite the effectiveness in training accurate models using small number of samples, active learning is not suitable in our use case of interactive search and exploration. Namely, it does not optimize for the relevance of the items shown to the user in each interaction round (in fact, the opposite), which is one of the main requirements in the design of multimedia analytics systems (Zahálka et al., 2015).
2.1.2. User Relevance Feedback
In contrast with active learning, user relevance feedback algorithms present the user, in each interaction round, with the items for which the model (e.g. classifier or regressor) is most confident (Rui et al., 1997). This strategy may require more interaction rounds for the same final quality of the model, but it is more likely to produce relevant items in each interaction round, which is of utmost importance when gaining insight into multimedia collections. Not only is it easier for the user to judge the items for which the model is most confident, but the process of gaining insight is complex, vaguely structured, and incremental, which requires looking at intermediate results rather than final results (Zahálka and Worring, 2014; North, 2006). User relevance feedback was frequently the weapon of choice in the best performing entries of benchmarks focusing on interactive video search and exploration (Snoek et al., 2008; Lokoč et al., 2018). However, those solutions were designed for collections far smaller than YFCC100M, which is the challenge we take in this paper.
Although attempts have been made to facilitate relevance feedback using CNNs, they are still considered a suboptimal choice for several reasons. Normally they require a large amount of labeled training data, while users are willing to annotate only a small number of samples in each interaction round. In addition, explainability of results is of utmost importance in analytical tasks, which is why linear models are preferred. Indeed, Linear SVM is still one of the most frequent choices in relevance feedback applications (Kovashka et al., 2015; Mironică et al., 2016; Zahálka et al., 2018b) due to its simplicity and the ability to produce accurate results with few annotated samples and scale to very large collections.
To the best of our knowledge, Blackthorn (Zahálka et al., 2018b) is the most efficient interactive multimodal learning approach in the literature. Compared to product quantization (Jégou et al., 2011), a popular alternative optimized for k-NN search, Blackthorn was found to yield significantly more accurate results over YFCC100M with similar latency, while consuming only modest computational resources. This performance is achieved through adaptive data compression and feature selection as well as the classification model capable of scoring items directly in the compressed domain. With this in mind, we conjecture that Blackthorn is the state of the art approach for our use case.
2.1.3. Requirements for Indexing
The most computationally intensive part of the interactive learning process is the selection of candidates to show to the user. This process must in principle examine the feature vectors of all items of the media collection, while eventually only a tiny fraction of the large collection is shown to the user. Thus, there remains a potential for performance improvement that interactive learning on its own does not tap into: utilizing the inherent structure of the feature vector collections.
Efficient utilization of the inherent structure of data is the domain of high-dimensional indexing and some early relevance feedback approaches did indeed consider indexing feature vectors for efficient scoring. However, since indexing approaches at that time generally only supported similarity-based queries, index-based relevance feedback approaches typically focused on refining search queries rather than classification boundaries. Furthermore, due to lack of scalable high-dimensional indexing methods in the past, these approaches were always limited to small collections only.
As we look towards today’s scalable high-dimensional indexing approaches as a potential source of performance improvements, we have identified the following requirements for a successful high-dimensional indexing approach that enhances the performance of interactive learning:
Short and Stable Response Time: The highly interactive nature of the process demands not only a short response time, but also predictable response time, to avoid distracting users.111The requirement for predictable latency is well known from other domains (Björling et al., 2017; Barroso et al., 2017; Amsaleg et al., 2018). The approximate nature of the queries and features, on the other hand, limits the impact of result quality guarantees in the high-dimensional space. A successful approach in the interactive learning setting combines good result quality with response time guarantees (Tavenard et al., 2011).
Preservation of Feature Space Similarity Structure: The purpose of interactive classifiers is to capture the user intent as it evolves during the interactive session. The classifiers capture this intent using a hyperplane that attempts to separate relevant items identified in previous interaction rounds from the rest of the collection, and compute relevance of the remaining items based on the similarity structure of the feature space. The space partitioning of the high-dimensional indexing algorithm must preserve this similarity structure.
Farthest Neighbours: As discussed in the above, relevance feedback approaches typically request the items farthest from the classification boundary. Furthermore, as the results are intended for display on screen, the index must return exactly farthest neighbours (-FN). Finally, since the interactive classifier is an approximation of the analyst’s intent, approximate answers are also acceptable.
We are not aware of any work in the high-dimensional literature specifically targeting approximate -FN where the query is a classification boundary. We therefore next review the related work and discuss how well different classes of high-dimensional indexing methods can potentially satisfy these three requirements.
2.2. High-Dimensional Indexing
Due to the curse of dimensionality, scalable high-dimensional indexing methods must rely on approximate similarity searches, typically trading off small reductions in quality (or even just quality guarantees) for dramatic response time improvements. In this section we therefore only consider approximate methods.
2.2.1. Approximate Nearest-Neighbour Queries
Most high-dimensional indexing methods rely on some form of quantization. The first group of methods uses scalar quantization. LSH, for example, uses random projections acting as locality preserving hashing functions (Gionis et al., 1999; Datar et al., 2004; Andoni and Indyk, 2006; Shakhnarovich et al., 2006). Its performance mainly depends on the quality and the number of hashing functions in use. Hence, many approaches improve hashing (Weiss et al., 2008; Jain et al., 2008; Tao et al., 2009; Wang et al., 2010; Paulevé et al., 2010; Zhang et al., 2011; Jin et al., 2013), whereas others reduce the number of hash functions (Lv et al., 2007; Joly and Buisson, 2008). LSH and similar methods, however, fail to satisfy the three requirements: they focus on quality guarantees rather than performance guarantees (R1); hashing creates “slices” in high-dimensional space, making ranking based on distance to a decision boundary impossible (R2); and they typically focus on -range queries, giving no guarantees on the number of results returned (R3). While too many answers can be handled by filtering, too few answers may also be returned.
The NV-tree is another high-dimensional indexing method which also uses random projections at its core (Lejsek et al., 2009; Lejsek et al., 2011). It recursively projects points onto segmented random lines and stores the resulting buckets onto disk. The NV-tree is a disk based method, designed for collections larger than RAM, and has been shown to outperm LSH for large-scale indexing (Lejsek et al., 2009). The NV-tree satisfies R1 and R3 well, but its leaves have irregular shapes and do not satisfy R2.
A second group of methods is based on vector quantization, typically using clustering approaches, such as -means, to determine a set of representative feature vectors to use for the quantization. These methods create Voronoï cells in the high-dimensional space, which satisfy R2 very well. Some methods, such as BoW-based methods, only store image identifiers in the clusters, thus failing to support R3, while others satisfy that requirement by storing the features and ranking the results in the nearest (or farthest) clusters. Finally, many clustering methods seek to match well the distribution of data in the high-dimensional space. Typically, these methods end with a large portion of the collection, often more than 20%, in a single cluster, which in turn takes very long to read and score, thus failing to satisfy R1. The extended Cluster Pruning (eCP) algorithm, however, is an example of a vector quantifier which attempts to balance cluster sizes for improved performance, thus aiming to satisfy all three requirements.
Product quantization (PQ) (Jégou et al., 2011) and its many variants (Xioufis et al., 2014; Ge et al., 2014; Kalantidis and Avrithis, 2014; Heo et al., 2014; Babenko and Lempitsky, 2012, 2015) cluster the high-dimensional vectors into low-dimensional subspaces that are indexed independently. Compared to hashing based methods, the ones relying on clustering better capture the location of points in the high-dimensional space, which in turn improves the quality of the approximate results that are returned. One of the main aims of PQ is compression of the data, however, and PQ-based methods essentially transform the Euclidean space, complicating the identification of furthest neighbours (R2). In (Zahálka et al., 2018a), PQ-compression was compared directly with the novel compression method proposed for Blackthorn; the results showed that with similar compression levels, PQ-compression yielded significantly inferior result quality. As PQ-compression is a pre-requisite for using PQ, it does not appear to be a promising candidate for user relevance feedback.
2.2.2. Hyperplane-Based Nearest-Neighbour Queries
Some researchers have considered this problem: given a collection of high-dimensional points, which are closest to a hyperplane cutting through the high-dimensional space? This problem is central to various active learning tasks, where the goal is to request labels for those points that are most informative, as described above, which in turn helps find the most appropriate SVM decision boundary. Projections and hash functions have been proposed (Crucianu et al., 2008; Jain et al., 2010; Basri et al., 2011; Vijayanarasimhan et al., 2014), which means that these hyperplane-based approaches from the literature are not applicable to user relevance feedback, based on the analysis above.
2.2.3. Farthest-Neighbour Queries
Farthest neighbour queries are in part motivated by the need to improve the diversity of what is returned to users, e.g., in applications making use of collaborative filtering for product recommendation (Abbar et al., 2013a, b; Said et al., 2013). The farthest neighbours problem consists of finding the vectors from a data set that maximize the distance to a query point. Approximate solutions (Agarwal et al., 1991; Indyk, 2003; Pagh et al., 2015; Curtin and Gardner, 2016; Xu et al., 2017; Pagh et al., 2017), based on hashing or exploiting the distribution of the data are often prefered to exact ones (Williams, 2004; de Berg et al., 2008), which are extremely costly to compute. Some methods are named c-Approximate as they return vectors that are at least times the distance of the query point to its true furthest neighbour (Indyk, 2003; Pagh et al., 2017). As before, these methods fail to satisfy the three requirements.
Based on the requirements above, and our analysis of the state of the art in high-dimensional indexing, we believe that cluster-based approaches, such as eCP, are the best candidates for relevance feedback. These approaches, however, have never before been used for -farthest neighbour queries from a decision boundary.
3. The Exquisitor Approach
In this section, we describe Exquisitor, the first active learning approach in the literature capable of interactive learning over the YFCC100M image collection using hardware resources similar to those found in high-end mobile devices.222While many current mobile devices have only 2–4GB of RAM, the trend is clearly towards larger RAM: the latest Samsung Galaxy models are equipped with 8GB of RAM, while the upcoming models will have 12GB (Kelly, 2019). Figure 2 shows an outline of the Exquisitor approach, and also an outline of this section. We start by considering the multimodal data representation, then describe the indexing and retrieval algorithms, before describing the choice of suggestions and interactions with the user.
To facilitate the exposition in this section, we use actual parameters and settings from the YFCC100M collection in various places, as this allows us to discuss several practical issues that arise when dealing with such a large and unstructured image collection. Needless to say, however, the Exquisitor approach can handle any image collection, including much larger collections than YFCC100M.
3.1. Image Representation
The YFCC100M collection contains 99,206,564 Flickr images, their associated annotations (i.e. title, tags and description), and a range of metadata produced by the capturing device, the online platform, and the user (e.g., geo-location and time stamps). Following recent literature, each image is represented by two semantic feature vectors. The visual content is encoded using 1,000 ILSVRC concepts (Russakovsky et al., 2015) extracted using the GoogLeNet convolutional neural network (Szegedy et al., 2015). The textual content is encoded by a) treating the title, tags, and description as a single text document, and b) extracting 100 LDA topics for each image using the gensim toolkit (Řehůřek and Sojka, 2010).
Directly working with these representations, however, is infeasible. They require around 880GB of memory, which not only hardly fits in RAM, but also is way beyond the storage capacity of mobile devices. The feature vectors are therefore compressed using the methodology presented in (Zahálka et al., 2018b). By storing only the 6 most important features of each feature vector, along with the feature identifiers, and by using a compression method based on the ratio between feature values, each feature vector can be represented using only three 64-bit integers, resulting in 48 bytes per image, or about 4.8GB in total.
The most important features are selected using a TF-IDF based approach, where the strongest features appear with high confidence in a few images (Zahálka et al., 2018b). Consider a feature with average value and standard deviation across all the items in the collection. For an individual item , the TF- IDF score of that feature is given by Equation 1, where is the number of items in the collection and is the Iverson bracket:
The TF portion is thus the value of the feature itself, while the IDF portion determines the feature’s rarity, represented by the fraction of the collection where the feature is strongly present. The features of each media item are sorted by in descending order and the top 6 features are selected to represent the item.
The two compressed feature vector collections have some interesting properties worth mentioning, that would not occur in a smaller collection that was easier to curate. First, some proportion of the images in the collection have been removed from Flickr, and therefore are represented in the collection using a standard “not found” image.333The image collection was actually downloaded very shortly after release, but already then this had become a significant issue. Interestingly, the classification concept most strongly associated with the “not found” images is “menu”. As a result, the visual feature vectors for these missing images are identical and if one is considered a candidate in the visual domain, they all are, potentially crowding out more suitable candidates. Second, a similar situation arises in the textual domain, where many images have no text tags, and hence their textual feature vector is all zeros. Third, due to the lower dimensionality of the textual feature vectors, the likelihood of two images having the same textual feature vector is much higher than the likelihood of two images having the same visual feature vector. As we show below, all these properties impact the cluster size distribution significantly, which in turn impacts the time required to propose suggestions. In short, the more even the distribution, the less processing time is required. However, our results show that even with these properties, very short latency is achieved.
3.2. Data Indexing
The data indexing algorithm used in Exquisitor is based on the extended Cluster Pruning (eCP) algorithm (Gudmundsson et al., 2010; Philbin et al., 2008). As motivated in Section 2, the goal is to individually cluster each of the two feature collections with a vectorial quantizer, using a hierarchical index structure to facilitate efficient selection of clusters to process for suggestions. The clusters for each collection are formed by randomly picking a set of feature vectors, called representatives, from the collection , and then assigning all feature vectors to these representatives based on proximity. The Euclidean distance function has been implemented directly in compressed space and used as the discriminative distance function for eCP.
To facilitate the assignment to clusters—as well as the subsequent retrieval from clusters—an index is created using Algorithm 1. When calling CreateIndex(), the algorithm recursively selects 1% of the features at each level as representatives for the level above, until fewer than 100 representatives remain to form the root of the index. The bottom level of the index for each modality in the YFCC100M collection thus consists of clusters, organized in a level deep index hierarchy, which gives on average feature vectors per cluster and per internal node.
Two notes are in order. First, when building the indices, the average cluster size was chosen to be small, as previous studies show that searching more small clusters yields better results than searching fewer large clusters (Gudmundsson et al., 2012; Sigurðardóttir et al., 2005). Second, eCP is essentially the first step of the -means algorithm. The reason for avoiding the refinement iterations of -means—in addition to efficiency—is that the cluster size distribution tends to become more skewed as more iterations are completed, and multiple works from the literature have shown that skewed cluster size distributions are anathema to stable response time (Gudmundsson et al., 2010; Tavenard et al., 2011; Sigurðardóttir et al., 2005; Baranchuk et al., 2018; Amsaleg et al., 2018). This is particularly important in the YFCC100M setting, as the feature collections already have some inherent skew, as mentioned above, which the indexing approach should not further aggravate.
3.3. Suggestion Retrieval
The retrieval of suggestions has the following three phases. First, the most relevant clusters are identified, then the most relevant candidates for each modality are identified, and finally the most relevant suggestions are selected from the set of candidates using modality fusion. Each phase is described below; the next subsection then discusses some extension to the basic retrieval method, including how to handle the extreme skew in the cluster size distribution. As noted in Section 2, to the best of our knowledge this is the first instance of using cluster-based indexing approaches to facilitate -farthest neighbors from a hyperplane.
3.3.1. Identify Most Relevant Clusters:
The number of top clusters considered can be adjusted by a search expansion parameter , which affects the size of the subset that will be scored. This parameter can be used to balance between search quality and latency at run-time. In each iteration of the interactive learning process, the index of representatives is used to identify, for each modality, the clusters most likely to contain useful candidates for suggestions. As described in Section 3.5, the classifier used in Exquisitor is Linear SVM; the dot-product computations to score representatives (and then feature vectors) are done directly in the compressed space and the clusters farthest from the separating plane (in the positive direction) are selected as the most relevant clusters.
3.3.2. Select Most Relevant Candidates per Modality:
Once the most relevant clusters have been identified, the compressed feature vectors within these clusters are scored to suggest the most relevant media items for each modality. The method of scoring individual feature vectors is the same as when selecting the most relevant clusters; an unordered list of the most relevant items is dynamically maintained throughout the scoring process.
Some notes are in order here. First, in this scoring phase, media items seen in previous rounds are not considered to be candidates for suggestions. Second, an item already seen in the first modality is not considered as a suggestion in the second modality, as it has already been identified as a candidate. Third, if all clusters are small, the system may not be able to identify candidates, in which case it simply returns all the candidates found in the clusters.
3.3.3. Modality Fusion for Most Relevant Suggestions:
Once the most relevant candidates from each modality have been identified, the modalities must be fused by aggregating the candidate lists to produce the final list of suggestions. First, for each candidate in one modality, the score in the other modality is computed if necessary, by directly accessing the compressed feature vector, resulting in candidates with scores in both modalities.444To facilitate late modality fusion, the location of each feature vector in each cluster index is also stored as an array; each such vector requires about 800KB of RAM. Second, the rank of each item in each modality is computed by sorting the candidates by the score in the modality. Finally, the average rank is used to produce the final list of most relevant suggestions, thus favoring items that score relatively well in both modalities.
3.4. Retrieval Extensions
We now describe three improvements to the basic retrieval algorithms, which aim at improving both latency and result quality, by addressing practical issues that arise in this real-life setting.
3.4.1. Multi-Core Processing:
Exquisitor can take advantage of the availability of multiple CPU cores. With cores available, the system creates workers and assigns clusters to each worker. Each worker produces suggestions in each modality and fuses the two modalities into candidates, as described above. The top candidates overall are then selected by repeating the modality fusion process using all suggestions from the workers.
3.4.2. Handling Skew:
As described above, with the YFCC100M collection, both modalities have 1-2 clusters that are very large, with more than 1M items. These clusters require significant effort to process, while contributing negligibly to the quality of results. Furthermore, in the text domain, many images may have identical feature vectors, resulting in high variability of the cluster sizes. To quantify the amount of such data skew, Figure 3 shows the number of clusters in each size range for each modality. Recall that both cluster indexes are created such that the average cluster should contain 100 feature vectors. Consider first the visual modality. As already mentioned, one cluster of “not found” images contains over 3M feature vectors. The second largest cluster, however, contains less than 10K feature vectors, and more than 85% of all clusters range from 11 to 1000 feature vectors. Note that about 35K clusters have 0 feature vectors. The representatives of these clusters are most likely all found in the large 3M+ cluster; such empty clusters are always omitted from consideration.
Turning to the the text modality, Figure 3 shows that the cluster size distribution is significantly more varied, with more large clusters and more empty clusters, but fewer clusters of mid-range sizes. As the empty clusters are ignored, more feature vectors are processed, on average, for the text modality. In the final processing of suggestions, however, both modalities are weighted equally.
With a single worker, the best strategy for handling large clusters is to avoid processing them. In Section 4.3, we explore the impact of omitting clusters above a size threshold on quality and performance. With multiple workers, we can also explore applying multiple workers to the largest clusters, while assigning small clusters to the remaining workers; this is future work.
3.4.3. Improving Quality:
We observe that media items that score moderately highly in both modalities (in the following, we refer to these as bi-modal media items) are more likely to be relevant than media items with a high score on one modality, but a low score on the other (we refer to these as mono-modal media items). In a setting where all clusters are processed by one worker, there is a risk that the candidate lists in each modality may be dominated by mono-modal items. When the mono-modal items happen to be part of a large cluster, the likelihood that bi-modal items are suggested becomes quite low. Note that turning to a median rank aggregation scheme based on the Condorcet criterion is not a solution in this case, as finding the bi-modal items would require reading a very substantial portions of the collection (Lejsek et al., 2006).
In a multi-worker setting, however, each worker produces its own set of suggestions, which are subsequently merged to produce the final result. The first clusters thus result in one set of candidates, the next clusters result in another set, and so forth. This means that the first 1-2 workers are likely to produce candidates that are mono-modal, while the remaining workers, which are processing moderately relevant clusters, are more likely to suggest bi-modal items. Indeed, our results show that in the basic configuration where each worker produces only one set of candidates, the result quality improves as workers are added, and the results with 16 workers are significantly better than with 1 worker.
We address this problem by introducing a new parameter to segment the clusters to be processed. Each worker then processes such segments. When a single worker is used (), that single worker then produces rounds of suggestions, each from clusters. With this approach, result quality is independent of the number of workers.
3.5. Relevance Judgment and Learning
Consistent with the state of the art in large-scale user relevance feedback, the classifier used in Exquisitor is linear SVM. The choice is further motivated by the algorithm’s speed, reasonable performance and compatibility with the sparse compressed representation. In each interaction round, the user is provided with a set of suggested items, marks the relevant and not relevant ones and submits those labels to the system. The system then takes the user’s labels, uses them as positive and negative training examples (the set of negative examples can also be augmented with a random selection from the large collection), trains an interactive classifier, and provides a new set of suggested items, avoiding items seen in previous rounds.
4. Experimental Evaluation
In this section, we experimentally analyse the performance and quality of Exquisitor with the YFCC100M collection. We first outline the experimental protocol followed, before describing the results of three key experiments, seeking to answer the following questions:
How does the performance of the Exquisitor approach compare to the state of the art in interactive learning, and what is the influence of the number of clusters () on the tradeoff between latency and result quality?
What is the impact of addressing skew, by omitting large clusters from processing, on latency and result quality?
What is the impact of applying additional CPU cores on the latency of Exquisitor?
4.1. Experimental Setup
As the literature, to the best of our knowledge, contains only one experimental interactive learning protocol that has been applied to the YFCC100M collection (Zahálka et al., 2018a), we have chosen to follow that protocol. This evaluation protocol is inspired by the well-known MediaEval Placing Task (Larson et al., 2011; Choi et al., 2015). From the 2016 edition of the MediaEval Placing Task, we have created artificial actors (or users) by selecting the 50 world cities represented with the largest number of images in the YFCC100M collection. The relevance set of each artificial actor then consists of the images and their associated metadata captured within 1000km from the centre of one city, which is the largest radius used in the Placing Task. A large radius was intentionally selected due to our focus on semantic relevance of the items instead of their exact capturing location. For each actor, the evaluation starts by pre-training the interactive classification model using 100 randomly selected relevant images as the positives and another 200 negative examples randomly selected from the collection. In each interaction round the actor is then presented with the 25 items considered most relevant by the model. As interface design is not the focus of this paper, we choose this number with, e.g., a basic 5x5 grid visualization in mind. Then, the items that are part of the actor’s profile are added to the set of positives and 100 randomly selected items are used as negatives to train the interactive learning model in the subsequent round.
To illustrate the tradeoffs between the interactive performance and result quality, we focus our analysis on precision and latency (response time) per interaction round. It is worth noting that due to both the scale of YFCC100M and its unstructured nature, precision is lower than in experiments involving small and well-curated collections. The important comparison is therefore to Blackthorn (Zahálka et al., 2018a), the only state of the art algorithm capable of handling YFCC100M with interactive performance.
Both Exquisitor and Blackthorn are compiled with g++. All experiments are performed using dual 8-core 2.4 GHz CPUs, with 64GB RAM and 4TB local SSD storage. Note, however, that the collection used in the experiments requires less than 7GB of SSD storage and RAM, and in most experiments Exquisitor uses only a single CPU core.
4.2. Experiment 1: Impact of Search Expansion
In this experiment, we explore the impact of the high-dimensional index. The primary parameter in the scoring process is , the number of clusters read and scored. Figure 4 analyses the impact of on the precision (fraction of relevant items seen) in each round of the interactive exploration. The -axis shows how many clusters are read for scoring at each round, ranging from to (note the logarithmic scale of the axis), while the -axis shows the average precision across the first 10 rounds of analysis. The figure shows precision for two Exquisitor variants, with and . In both cases, only one worker is used, . For comparison, the figure also shows the average precision for Blackthorn.
As Figure 4 shows, result quality is surprisingly good when scoring only a single cluster in each interaction round, returning about two-thirds of the precision of the state-of-the-art algorithm. As more clusters are considered, quality then improves further. As expected, dividing the clusters into chunks results in better quality, an effect that becomes more and more pronounced as grows. In particular, with , Exquisitor returns significantly better results than Blackthorn, even though Blackthorn considers every media item in the collection. The reason is that by assigning the relevant clusters to segments, Exquisitor is able to emphasize more the bi-modal media items. Note that as further clusters are added with Exquisitor ( and beyond), the results become more and more similar to the Blackthorn results.
Figure 5, on the other hand, shows the latency per interaction round. The figure shows the two Exquisitor variants, with and ; in both cases, one worker is used, . For comparison, as before, it also shows the average latency for Blackthorn (with 16 CPU cores). Unsurprisingly, Figure 5 shows linear growth in latency with respect to (recall the logarithmic -axis). With , each interaction round takes about 0.29 seconds with , and about 0.17 seconds with . Both clearly allow for interactive performance; the remainder of our experiments focus on . If even shorter latency is desired, however, fewer clusters can be read: , for example, also gives a good tradeoff between latency and result quality. Recall that this latency is produced using only a single CPU core, meaning that the latency is about 4x better than Blackthorn, with 16x fewer computing cores, for an improvement of about 64x, or nearly two orders of magnitude.
4.3. Experiment 2: Impact of Data Skew
In this experiment, we explore the impact of handling data skew by omitting large clusters from consideration. Figure 6 shows the impact of the parameter on both latency and precision. The -axis shows , as it is decreased from 10M (no impact, since the largest cluster is 3.5M), to 1M (excluding one cluster in the visual modality and two in the text modality), and then further down to 100 features, where it excludes very large portions of the collection. The -axis shows the relative impact on both precision and latency, compared to the results when all clusters are considered.
Figure 6 shows that omitting the three largest clusters (1M) improves latency considerably, without any impact on precision. As more and more clusters are omitted from consideration, however, the impact on either parameter is minimal, until , where both latency and precision are improved further. While this is an interesting effect that warrants further exploration, and is most likely related again to the tradeoff between bi-modal and mono-modal items, we leave it to future work to analyse it in detail.
4.4. Experiment 3: Multi-Core Processing
The primary parameter in this experiment is , the number of workers applied to the scoring process. Based on the previous results, we read clusters in segments in each iteration, and omit from consideration clusters larger than 1M or , respectively. Note that as workers are added, each worker reads a proportionately smaller share of the segments, so precision is not affected by adding workers. We therefore focus on latency.
Figure 7 shows the evolution of latency, relative to a single worker, as the number of workers (or CPU cores) is varied from to . As the figure shows, response time is improved somewhat by adding more workers, with latency improved by about 40% with workers. The reason latency is not improved further is that a) much of the processing in each interaction round takes place on a single CPU core, including updating the linear SVM model and merging the suggestion lists produced by workers, and b) the scoring process is already so efficient that it does not benefit further from the added CPU cores.
In this paper, we presented Exquisitor, a new approach for exploratory analysis of very large image collections with modest computational requirements. Exquisitor combines state-of-the-art large-scale interactive learning with a new cluster-based retrieval mechanism, enhancing the relevance capabilities of interactive learning by exploiting the inherent structure of the data. Experiments on YFCC100M, the largest publicly available multimedia collection, show that Exquisitor achieves higher precision and lower latency, with less computational resources, resulting in performance improvements on nearly two orders of magnitude. Additionally, Exquisitor introduces customizability that is, to the best of our knowledge, previously unseen in large-scale interactive learning, by: (i) allowing a tradeoff between low latency (few clusters) and high quality (many clusters); (ii) combating data skew by omitting huge (and thus likely nondescript) clusters from consideration; and (iii) improving latency by adding CPU cores, without any impact on quality. In conclusion, Exquisitor provides the best performance on very large collections while being efficient enough to bring large-scale multimedia analytics to standard desktops and laptops, and even high-end mobile devices.
- Abbar et al. (2013a) Sofiane Abbar, Sihem Amer-Yahia, Piotr Indyk, and Sepideh Mahabadi. 2013a. Real-time recommendation of diverse related articles. In Proc. WWW. International World Wide Web Conferences Steering Committee/ACM, Rio de Janeiro, Brazil, 1–12.
- Abbar et al. (2013b) Sofiane Abbar, Sihem Amer-Yahia, Piotr Indyk, Sepideh Mahabadi, and Kasturi R. Varadarajan. 2013b. Diverse near neighbor problem. In Proc. Symposium on Computational Geometry (SOGG). ACM, Rio de Janeiro, Brazil, 207–214.
- Agarwal et al. (1991) Pankaj K. Agarwal, Jiří Matoušek, and Subhash Suri. 1991. Farthest Neighbors, Maximum Spanning Trees and Related Problems in Higher Dimensions. Comput. Geom. 1 (1991), 189–201.
- Amsaleg et al. (2018) Laurent Amsaleg, Björn Þór Jónsson, and Herwig Lejsek. 2018. Scalability of the NV-tree: Three Experiments. In Proc. SISAP. Springer, Lima, Peru, 59–72.
- Andoni and Indyk (2006) Alexandr Andoni and Piotr Indyk. 2006. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. In Proceedings of the IEEE Symposium on the Foundations of Computer Science. IEEE Computer Society, Berkeley, CA, USA, 459–468.
- Babenko and Lempitsky (2012) Artem Babenko and Victor S. Lempitsky. 2012. The inverted multi-index. In Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition. IEEE, Providence, RI, USA, 3069–3076.
- Babenko and Lempitsky (2015) Artem Babenko and Victor S. Lempitsky. 2015. The Inverted Multi-Index. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 6 (2015), 1247–1260.
- Baranchuk et al. (2018) Dmitry Baranchuk, Artem Babenko, and Yury Malkov. 2018. Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. In Proc. ECCV. Springer, Munich, Germany, 209–224.
- Barroso et al. (2017) Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the Killer Microseconds. CACM 60, 4 (March 2017), 48–54.
- Basri et al. (2011) Ronen Basri, Tal Hassner, and Lihi Zelnik-Manor. 2011. Approximate Nearest Subspace Search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 2 (2011), 266–278.
- Beluch et al. (2018) William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. 2018. The power of ensembles for active learning in image classification. In Proc. IEEE CVPR. IEEE Computer Society, Salt Lake City, UT, USA, 9368–9377.
- Björling et al. (2017) Matias Björling, Javier Gonzalez, and Philippe Bonnet. 2017. LightNVM: The Linux Open-Channel SSD Subsystem. In Proc. USENIX Conference on File and Storage Technologies (FAST). USENIX Association, Santa Clara, CA, USA, 359–374.
- Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1–27:27. Issue 3. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Choi et al. (2015) Jaeyoung Choi, Claudia Hauff, Olivier Van Laere, and Bart Thomee. 2015. The placing task at MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop. CEUR, Wurzen, Germany, 2.
- Cohn et al. (1996) David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. Active Learning with Statistical Models. JAIR 4, 1 (March 1996), 129–145.
- Crucianu et al. (2008) Michel Crucianu, Daniel Estevez, Vincent Oria, and Jean-Philippe Tarel. 2008. Speeding up active relevance feedback with approximate kNN retrieval for hyperplane queries. Int. J. Imaging Systems and Technology 18, 2-3 (2008), 150–159.
- Curtin and Gardner (2016) Ryan R. Curtin and Andrew B. Gardner. 2016. Fast Approximate Furthest Neighbors with Data-Dependent Candidate Selection. In Proc. SISAP. Springer, Tokyo, Japan, 221–235.
- Datar et al. (2004) Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. ACM Symposium on Computational Geometry. ACM, Brooklyn, NY, USA, 253–262.
- de Berg et al. (2008) Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars. 2008. Computational geometry: algorithms and applications, 3rd Edition. Springer, Berlin.
- Flickner et al. (1995) M. Flickner, H. Sawhney, W. Niblack, J. Ashley, , B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. 1995. Query by image and video content: the QBIC system. Computer 28, 9 (Sep. 1995), 23–32. https://doi.org/10.1109/2.410146
- Ge et al. (2014) Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized Product Quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 4 (2014), 744–755.
- Gionis et al. (1999) Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann, Edinburgh, Scotland, 518–529.
- Gudmundsson et al. (2012) Gylfi Þór Gudmundsson, Laurent Amsaleg, and Björn Þór Jónsson. 2012. Impact of Storage Technology on the Efficiency of Cluster-based High-dimensional Index Creation. In Proc. International Conference on Database Systems for Advanced Applications (DASFAA). Springer, Busan, South Korea, 53–64.
- Gudmundsson et al. (2017) Gylfi Þór Gudmundsson, Laurent Amsaleg, Björn Þór Jónsson, and Michael J. Franklin. 2017. Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark. In Proc. ACM Multimedia Systems Conference (MMSys). ACM, Taipei, Taiwan, 1–12.
- Gudmundsson et al. (2010) Gylfi Þór Gudmundsson, Björn Þór Jónsson, and Laurent Amsaleg. 2010. A Large-scale Performance Study of Cluster-based High-dimensional Indexing. In Proc. International Workshop on Very-large-scale Multimedia Corpus, Mining and Retrieval (VLS-MCMR). ACM, Firenze, Italy, 31–36.
- Heo et al. (2014) Jae-Pil Heo, Zhe Lin, and Sung-Eui Yoon. 2014. Distance Encoded Product Quantization. In Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition. IEEE Computer Society, Columbus, OH, USA, 2139–2146.
- Huang et al. (2008) T.S. Huang, C.K. Dagli, S. Rajaram, E.Y. Chang, M.I. Mandel, Graham E. Poliner, and D.P.W. Ellis. 2008. Active Learning for Interactive Multimedia Retrieval. Proc. IEEE 96, 4 (2008), 648–667. https://doi.org/10.1109/JPROC.2008.916364
- Huijser and v. Gemert (2017) M. Huijser and J. C. v. Gemert. 2017. Active Decision Boundary Annotation with Deep Generative Models. In Proc. IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, Venice, Italy, 5296–5305.
- Indyk (2003) Piotr Indyk. 2003. Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA). ACM/SIAM, Baltimore, MD, USA, 539–545.
- Iscen et al. (2018) A. Iscen, T. Furon, V. Gripon, M. Rabbat, and H. Jégou. 2018. Memory Vectors for Similarity Search in High-Dimensional Spaces. IEEE Transactions on Big Data 4, 1 (March 2018), 65–77. https://doi.org/10.1109/TBDATA.2017.2677964
- Jain et al. (2008) Prateek Jain, Brian Kulis, and Kristen Grauman. 2008. Fast image search for learned metrics. In Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition. IEEE Computer Society, Anchorage, AK, USA.
- Jain et al. (2010) Prateek Jain, Sudheendra Vijayanarasimhan, and Kristen Grauman. 2010. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. In Proc. Conference on Neural Information Processing Systems (NIPS). Curran Associates, Inc., Vancouver, BC, Canada, 928–936.
- Jégou et al. (2011) Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (2011), 117–128.
- Jiang et al. (2015) Lu Jiang, Shoou-I Yu, Deyu Meng, Yi Yang, Teruko Mitamura, and Alexander G. Hauptmann. 2015. Fast and Accurate Content-based Semantic Search in 100M Internet Videos. In Proc. ACM Multimedia. ACM, Brisbane, Australia, 49–58.
- Jin et al. (2013) Zhongming Jin, Yao Hu, Yue Lin, Debing Zhang, Shiding Lin, Deng Cai, and Xuelong Li. 2013. Complementary Projection Hashing. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, Barcelona, Spain, 257–264.
- Joly and Buisson (2008) Alexis Joly and Olivier Buisson. 2008. A posteriori multi-probe locality sensitive hashing. In Proceedings of the ACM International Conference on Multimedia. ACM, Vancouver, BC, Canada, 209–218.
- Kalantidis and Avrithis (2014) Yannis S. Kalantidis and Yannis Avrithis. 2014. Locally Optimized Product Quantization for Approximate Nearest Neighbor Search. In Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition. IEEE Computer Society, Columbus, OH, USA, 2329–2336.
- Kelly (2019) Gordon Kelly. Jan 15, 2019. Samsung Holding Back Galaxy S10’s RAM And Storage? (Jan 15, 2019). https://www.forbes.com/sites/gordonkelly/2019/01/15/samsung-galaxy-s10-upgrade-release-date-cost-specs-ram-storage-galaxy-s9-note9/
- Kovashka et al. (2015) Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2015. WhittleSearch: Interactive Image Search with Relative Attribute Feedback. International Journal of Computer Vision 115, 2 (01 Nov 2015), 185–210. https://doi.org/10.1007/s11263-015-0814-0
- Larson et al. (2011) Martha Larson, Mohammad Soleymani, Pavel Serdyukov, Stevan Rudinac, Christian Wartena, Vanessa Murdock, Gerald Friedland, Roeland Ordelman, and Gareth J. F. Jones. 2011. Automatic Tagging and Geotagging in Video Collections and Communities. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR ’11). ACM, New York, NY, USA, Article 51, 8 pages. https://doi.org/10.1145/1991996.1992047
- Lejsek et al. (2006) Herwig Lejsek, Friðrik Heiðar Ásmundsson, Björn Þór Jónsson, and Laurent Amsaleg. 2006. Scalability of local image descriptors: a comparative study. In Proc. ACM Multimedia. ACM, Santa Barbara, CA, USA, 589–598.
- Lejsek et al. (2009) Herwig Lejsek, Fridrik Heidar Ásmunðsson, Bjorn Þor Jónsson, and Laurent Amsaleg. 2009. NV-Tree: An Efficient Disk-Based Index for Approximate Search in Very Large High-Dimensional Collections. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 5 (2009), 869–883.
- Lejsek et al. (2011) Herwig Lejsek, Björn Þór Jónsson, and Laurent Amsaleg. 2011. NV-Tree: nearest neighbors at the billion scale. In Proceedings of the ACM International Conference on Multimedia Retrieval. ACM, Trento, Italy, Article 54, 8 pages.
- Lokoč et al. (2018) J. Lokoč, W. Bailer, K. Schoeffmann, B. Muenzer, and G. Awad. 2018. On Influential Trends in Interactive Video Retrieval: Video Browser Showdown 2015–2017. IEEE Transactions on Multimedia 20, 12 (Dec 2018), 3361–3376. https://doi.org/10.1109/TMM.2018.2830110
- Lv et al. (2007) Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, Vienna, Austria, 950–961.
- Mironică et al. (2016) Ionuţ Mironică, Bogdan Ionescu, Jasper Uijlings, and Nicu Sebe. 2016. Fisher Kernel Temporal Variation-based Relevance Feedback for video retrieval. Computer Vision and Image Understanding 143 (2016), 38 – 51. https://doi.org/10.1016/j.cviu.2015.10.005 Inference and Learning of Graphical Models Theory and Applications in Computer Vision and Image Analysis.
- North (2006) C. North. 2006. Toward measuring visualization insight. IEEE Computer Graphics and Applications 26, 3 (May 2006), 6–9. https://doi.org/10.1109/MCG.2006.70
- Pagh et al. (2015) Rasmus Pagh, Francesco Silvestri, Johan Sivertsen, and Matthew Skala. 2015. Approximate Furthest Neighbor in High Dimensions. In Proc. SISAP, Vol. 9371. Springer, Glasgow, Scotland, 3–14.
- Pagh et al. (2017) Rasmus Pagh, Francesco Silvestri, Johan Sivertsen, and Matthew Skala. 2017. Approximate furthest neighbor with application to annulus query. Inf. Syst. 64 (2017), 152–162.
- Paulevé et al. (2010) Loïc Paulevé, Hervé Jégou, and Laurent Amsaleg. 2010. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognition Letters 31, 11 (2010), 1348–1358.
- Philbin et al. (2008) James Philbin, Ondra Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition. IEEE Computer Society, Anchorage, AK, USA.
- Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.
- Rudinac et al. (2018) Stevan Rudinac, Tat-Seng Chua, Nicolas Diaz-Ferreyra, Gerald Friedland, Tatjana Gornostaja, Benoit Huet, Rianne Kaptein, Krister Lindén, Marie-Francine Moens, Jaakko Peltonen, Miriam Redi, Markus Schedl, David A. Shamma, Alan Smeaton, and Lexing Xie. 2018. Rethinking Summarization and Storytelling for Modern Social Multimedia. In MultiMedia Modeling, Klaus Schoeffmann, Thanarat H. Chalidabhongse, Chong Wah Ngo, Supavadee Aramvith, Noel E. O’Connor, Yo-Sung Ho, Moncef Gabbouj, and Ahmed Elgammal (Eds.). Springer International Publishing, Cham, 632–644.
- Rui et al. (1997) Y. Rui, T. S. Huang, and S. Mehrotra. 1997. Content-based image retrieval with relevance feedback in MARS. In Proc. International Conference on Image Processing (ICIP). IEEE Computer Society, Santa Barbara, CA, USA, 815–818.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (01 Dec 2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
- Said et al. (2013) Alan Said, Ben Fields, Brijnesh J. Jain, and Sahin Albayrak. 2013. User-centric evaluation of a K-furthest neighbor collaborative filtering recommender algorithm. In Proc. CSCW. ACM, San Antonio, TX, USA, 1399–1408.
- Schoeffmann et al. (2018) Klaus Schoeffmann, Werner Bailer, Cathal Gurrin, George Awad, and Jakub Lokoč. 2018. Interactive Video Search: Where is the User in the Age of Deep Learning?. In Proc ACM Multimedia. ACM, Seoul, Republic of Korea, 2101–2103.
- Shakhnarovich et al. (2006) Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. 2006. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Pattern Analysis and Applications 11, 2 (2006).
- Sigurðardóttir et al. (2005) Rut Sigurðardóttir, Hlynur Hauksson, Björn Þór Jónsson, and Laurent Amsaleg. 2005. The Quality vs. Time Tradeoff for Approximate Image Descriptor Search. In Proc. IEEE EMMA workshop. IEEE, Tokyo, Japan.
- Snoek et al. (2008) C.G.M. Snoek, M. Worring, O. de Rooij, K.E.A. van de Sande, Rong Yan, and A.G. Hauptmann. 2008. VideOlympics: Real-Time Evaluation of Multimedia Retrieval Systems. IEEE MM 15, 1 (2008), 86–91. https://doi.org/10.1109/MMUL.2008.21
- Szegedy et al. (2015) C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proc. IEEE CVPR. IEEE Computer Society, Boston, MA, USA, 1–9.
- Tao et al. (2009) Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2009. Quality and Efficiency in High Dimensional Nearest Neighbor Search. In Proceedings of the ACM International Conference on Management of Data. ACM, Boston, MA, USA, 563–576.
- Tavenard et al. (2011) Romain Tavenard, Hervé Jégou, and Laurent Amsaleg. 2011. Balancing clusters to reduce response time variability in large scale image search. In International Workshop on Content-Based Multimedia Indexing. IEEE, Madrid, Spain.
- Thomee et al. (2016) Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The New Data in Multimedia Research. Commun. ACM 59, 2 (2016), 64–73.
- Vijayanarasimhan et al. (2014) Sudheendra Vijayanarasimhan, Prateek Jain, and Kristen Grauman. 2014. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2 (2014), 276–288.
- Wang et al. (2010) Jun Wang, Ondrej Kumar, and Shih-Fu Chang. 2010. Semi-supervised hashing for scalable image retrieval. In Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition. IEEE Computer Society, San Francisco, CA, USA, 3424–3431.
- Wang et al. (2018) J. Wang, T. Zhang, j. song, N. Sebe, and H. T. Shen. 2018. A Survey on Learning to Hash. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (April 2018), 769–790. https://doi.org/10.1109/TPAMI.2017.2699960
- Wang et al. (2017) Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. 2017. Cost-Effective Active Learning for Deep Image Classification. IEEE Trans. Cir. and Sys. for Video Technol. 27, 12 (2017), 2591–2600.
- Weiss et al. (2008) Yair Weiss, Antonio Torralba, and Robert Fergus. 2008. Spectral Hashing. In Neural Information Processing Systems. Curran Associates, Vancouver, BC, Canada, 1753–1760.
- Williams (2004) Ryan Williams. 2004. A New Algorithm for Optimal Constraint Satisfaction and Its Implications. In Proc. ICALP. Springer, Turku, Finland, 1227–1237.
- Xioufis et al. (2014) Eleftherios Spyromitros Xioufis, Symeon Papadopoulos, Yiannis Kompatsiaris, Grigorios Tsoumakas, and Ioannis P. Vlahavas. 2014. A Comprehensive Study Over VLAD and Product Quantization in Large-Scale Image Retrieval. IEEE Transactions on Multimedia 16, 6 (2014), 1713–1728.
- Xu et al. (2017) Xiao-Jun Xu, Jin-Song Bao, Bin Yao, Jingyu Zhou, Feilong Tang, Minyi Guo, and Jianqiu Xu. 2017. Reverse Furthest Neighbors Query in Road Networks. J. Comput. Sci. Technol. 32, 1 (2017), 155–167.
- Yang et al. (2015) Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G. Hauptmann. 2015. Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization. International Journal of Computer Vision 113, 2 (01 Jun 2015), 113–127. https://doi.org/10.1007/s11263-014-0781-x
- Zahálka et al. (2015) Jan Zahálka, Stevan Rudinac, and Marcel Worring. 2015. Analytic Quality: Evaluation of Performance and Insight in Multimedia Collection Analysis. In Proc. ACM Multimedia. ACM, Brisbane, Australia, 231–240.
- Zahálka et al. (2018a) J. Zahálka, S. Rudinac, B. Þ. Jónsson, D. C. Koelma, and M. Worring. 2018a. Blackthorn: Large-Scale Interactive Multimodal Learning. IEEE Transactions on Multimedia 20, 3 (March 2018), 687–698. https://doi.org/10.1109/TMM.2017.2755986
- Zahálka et al. (2018b) J. Zahálka, S. Rudinac, B. Þ. Jónsson, D. C. Koelma, and M. Worring. 2018b. Blackthorn: Large-Scale Interactive Multimodal Learning. IEEE Transactions on Multimedia 20, 3 (March 2018), 687–698. https://doi.org/10.1109/TMM.2017.2755986
- Zahálka and Worring (2014) J. Zahálka and M. Worring. 2014. Towards interactive, intelligent, and integrated multimedia analytics. In Proc. IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE Computer Society, Paris, France, 3–12.
- Zhang et al. (2011) Dongxiang Zhang, D. Agrawal, Gang Chen, and A. Tung. 2011. HashFile: An efficient index structure for multimedia data. In Proceedings of the IEEE International Conference on Data Engineering. IEEE Computer Society, Hannover, Germany, 1103–1114.