Give me a hint! Navigating Image Databases using Human-in-the-loop Feedback

Give me a hint! Navigating Image Databases using Human-in-the-loop Feedback


In this paper, we introduce an attribute-based interactive image search which can leverage human-in-the-loop feedback to iteratively refine image search results. We study active image search where human feedback is solicited exclusively in visual form, without using relative attribute annotations used by prior work which are not typically found in many datasets. In order to optimize the image selection strategy, a deep reinforcement model [26] is trained to learn what images are informative rather than rely on hand-crafted measures typically leveraged in prior work. Additionally, we extend the recently introduced Conditional Similarity Network [39] to incorporate global similarity in training visual embeddings, which results in more natural transitions as the user explores the learned similarity embeddings. Our experiments demonstrate the effectiveness of our approach, producing compelling results on both active image search and image attribute representation tasks.


1 Introduction

In image search applications the user often has the mental picture of their desired content. The ultimate goal of image search is to convey this mental picture to the system and overcome the difference between the lower-level image representation and the higher-level conceptual content. Describing the desired image may be time consuming, however, and an image search system may not find the image even with an accurate description. To remedy this issue, interactive search techniques (\eg [3, 6, 10, 11, 21, 24, 32, 37, 48, 49, 50]) obtain user feedback to iteratively refine system returned results, often by asking questions in visual form. In particular, there has been a recent focus of this type of iterative refinement using relative attribute feedback [16, 17, 20, 27]. As seen in Figure 1(a), this enables a system to provide targeted feedback, but require relative attribute annotations not typically found in image datasets.


Examples of User Feedback\topinset(a)\topinset(b).23in.55in.23in-.53in-.0in.21in

Figure 1: Comparison of active feedback types for image search. Recent work in iterative image search leverages relative attribute annotations enabling the collection of targeted feedback as seen in (a). However, these relative attributes are not typically collected natively in many datasets. Instead, we use the meta-data or categorical labels already present in most datasets to build a representation for our active image search approach. While this relies on more ambiguous feedback since users can define their own similarity criterion, as shown in (b), our experiments show it is sufficient most of the time.

In this paper, we propose an interactive image search system which doesn’t use relative attribute annotations. Instead, we learn an image embedding trained on meta-data labels which are collected natively in e-commerce datasets. These labels identify attributes with clear definitions (\egdoes the shirt have long sleeves?), and are also be useful for other tasks such as organizing and filtering a dataset. Relative attributes, by comparison, can be subjective in nature, and annotators may prefer to label as many as 40% of image pairs as having equal amounts of an attribute [44], making their usefulness beyond active image search unclear. This suggests a trade-off between the low annotation cost of our approach (due to using existing annotations) and expected performance gains from targeted feedback using relative attributes. This paper takes a step towards characterizing the nature of this trade-off.

We begin an interactive image search session by presenting a user with an initial set of candidate images after receiving an initial query. A user simply selects the image which is the most visually similar to their target image (see Figure 1(b) for an example of the expected feedback). We incorporate the new information provided by the user into our model and then select the next set of images. Thus, the goal is to select the most informative images to present to the user on each iteration. While this style of feedback provides less information than many relative attribute approaches, since each user determines their own similarity criteria resulting in different responses, our experiments show it is “good enough” in many cases.

A popular selection criterion is Expected Error Reduction (EER) [2, 16, 18, 23, 25, 33]. This strategy chooses images that provide the largest reduction in the generalization error of the current model, but its high computational cost is disqualifying for many tasks. As such, EER is typically computed on a short list of candidates (\egexemplars from hierarchical clusters [16, 23]). We experiment with two low-cost sampling strategies to obtain a candidate list in this work: a nearest neighbor baseline, which largely ignores user feedback, and a criterion that greedily selects images reflecting the feedback from prior iterations. Figure 2 contains an overview of our active image search process.

A limitation of hand-crafted criteria like EER is the inability take advantage of the interplay between attributes. For example, knowing the target shirt has a collar also provides some information about the type of shirt being searched for. A good sampling criteria should be able to take advantage of such information as well as adapt to user behavior. Thus, we employ reinforcement learning using a Deep Q-Network [26] to learn to select informative images which we use to refine our list of candidates.

Figure 2: Search refinement process. At test time we are given an initial query as input to our system. On each iteration, we search our database using our “candidate selector” strategies to obtain an initial set of candidates. We use a “candidate re-ranker” on this set of images using informative, but computationally expensive selection criteria. During the “user feedback” step the user indicates if the new refined candidates are more representative of their desired image or not. If they accept a new image, it becomes the query for the next iteration.

To learn a good feature representation for our task, we introduce several enhancements to the state-of-the-art Conditional Similarity Network (CSN) [39]. This model trains a single network to learn an embedding for multiple attributes. It accomplishes this by learning a masking function for each attribute which selects features in a general representation for an image that is important to separating images in that concept space. This provides multiple views of the images in our database which has proven useful on similar tasks (\eg[15, 16, 17, 27]) and tends to perform better than training separate embedding models for each concept. Whereas the authors of the CSN model considered the label for each embedding in isolation (\iean embedding trained for colors would only care if both images were blue), we also factor the overall similarity between two images when training our representation. The resulting model encourages samples to separate into homogeneous subgroups in each embedding space. Therefore, as we traverse an attribute embedding, \egheel height, a transition from boot to a stiletto in a single step would be unlikely even if they both have the same sized heel. Combined with constraints which enable us to better exploit our training data, we show significant performance improvements in measuring the similarity between two images with regards to a specific concept. We provide an overview of our approach’s parts in Figure 3.

Figure 3: Model Overview. Our model consists of three major components. First, we train a feature extractor which computes an embedding representation for each image in our database that can be projected into an attribute specific space using a learned mask. These are fed into our Candidate Selectors, which obtains a list of likely candidates. Finally, from these candidates we select the most informative image according to each attribute using a DQN consisting of three fully connected layers followed by a ReLU.

Our primary contributions are summarized below:

  • We build a system which refines image search results without using the relative attribute annotations or attribute inputs required in prior work.

  • We introduce enhancements to the Conditional Similarity Network which encourages smooth transitions as we traverse the learned embedding space (Section 3.1).

  • We propose a Deep-Q Network-based selection criteria instead of hand-crafted methods (Section 3.2).

Our experiments in Section 4 show our image representation reduces attribute matching errors by 2.5-3% on the UT Zap-50K [44] and OSR [28] datasets while also finding a specific image in a database faster than hand-crafted sampling strategies.

2 Related Work

Attribute-based interactive image search. A key difference between this paper and prior work (\eg [17]) is that we train our models using annotations which already exist in many datasets. Much of the recent work on this task has focused on how to best utilize models trained using expensive relative attribute annotations (\eg[16, 17, 27]) or requires a user to specify which attributes their target image has (\eg[48, 49]). These assumptions provide attribute specific feedback so the model knows exactly how to alter the current image to make it more like the target image, but are more costly than our approach in terms of annotations, user requirements, or both.

Reinforcement learning and active learning. Recently there has been an ever-growing trend of abandoning hand-crafted approaches in favor of learned models. Training models for selecting informative examples in active learning, however, has primarily focused on how best to combine hand-crafted sampling strategies (\eg[1, 29]). In [8], the authors used reinforcement learning to select which hand-crafted strategy to use on each iteration. This idea was extended in [22] to select which annotator to use as well as finding informative samples. In contrast, our approach creates an entirely new criterion rather than combining hand-crafted strategies, sharing a similar spirit to some early work in relevance feedback (\eg [42, 43]).

Relative Attributes. Prior work in predicting relative attributes include using pairwise supervision to learn linear rankers [30], multi-task learning [4], and fusing binary and relative attribute labels in a model that would make predictions for both types of labels [40]. Some works found training local rankers could lead to improved performance [44, 45]. Deep networks have also been used to rank attributes [36] as well as localize them [35]. However, most of these approaches rely solely on expensive pairwise supervision in order to train their models. An exception is Yu \etal [46] which augmented their annotated pairs with synthetically generated images. In our work, we use labels which are typically found natively in many datasets rather than rely on relative attribute annotations.

3 Image Search with Active Feedback

Our objective is to quickly locate a target image in a database given a query . While the initial query can take multiple forms (\egkeywords, images [5, 12], or sketches [47]), we will assume it is provided as an image which shares some desirable attribute with the target image. In order to locate the target image by obtaining feedback from the user, we need a representation where we can measure similarity between images in the database as well as a selection strategy which uses this representation to find informative images to present to the user. In this paper we provide enhancements for both learning the image representation, which we shall discuss in Section 3.1, and sampling strategies, which we will present in Section 3.2.

3.1 Globally-Consistent Attribute Embeddings

To compare two images, we train a set of embeddings, each representing a different attribute we wish to capture. This provides multiple senses of each image which we can use to select informative images to the user. Due to its state-of-the-art performance and efficiency, we chose to build upon the CSN model [39] which we shall briefly review before describing our modifications.

Conditional Similarity Network

The CSN model was designed to learn a disentangled embedding for different attributes in a single model. A general image representation is learned through the image encoding layers of the model. Then a trained mask is applied to the representation to isolate the features important to that specific attribute. This enables each embedding to share some common parameters across concepts, while the mask is tasked with transforming the features into a discriminative representation. After obtaining the general embedding features between two images , they are compared using a masked distance function,


where is the mask for some attribute and denotes an element-wise multiplication. Then, given a triplet of embedding features where the pair share the same attribute label which is also not shared by , the CSN model is trained using the margin based loss function given by

where controls the minimum margin between positive and negative pairs. The general unmasked embedding representation is L2 regularized to encourage regularly in the latent embedding space. The masks are regularized to encourage a sparse feature selection. Thus, the complete loss function is

where are scalar parameters.

We modify the original CSN model by L2 normalizing the final attribute representation (\ie in Eq. (1)) as this tends to make training more stable [34]. In addition, since the masks can be viewed as an attention over the general embedding features we force them to sum to 1 as typically done for attention models (\eg [41]).

Incorporating Global Compatibility

Since our goal is to traverse our embeddings in order to locate some target image, it is desirable that they provide natural transitions from image to image. For example, if we were to transition from the anchor image to the rightmost image in Figure 4 it would be considered a significant divergence. The center image, while still different, seems like a more logical transition even though all three images belong to the boot category. Therefore, to make our embedding spaces more intuitive, we also take into account the overall similarity between two images beyond the attribute being encoded. Given the set of attributes for each of the images in our training triplet, we compute the difference in shared attributes between the negative and positive pairs:


where represents the number of embeddings being trained. We prevent negative values of to maintain a minimum margin between negative and positive pairs of the triplet. We define our new margin for Eq. (3.1.1) as


where are a scalar parameters.

Anchor Image
Shared Attributes with Anchor: boot boot
Figure 4: During training, we take into account the overall similarity between images based on the number of shared attributes. This encourages the model to maintain the left-to-right ordering of the images above in a category embedding space even though they all belong to the boot category

3.2 Image Sampling Strategies

Using the representation for the images in our database from Section 3.1, our task is to select the most informative images to present to the user for feedback in order to quickly locate target image . We begin by obtaining a short list of candidate images in Section 3.2.1, before refining this list using more powerful, but computationally expensive methods in Section 3.2.2.

Candidate Selection Methods

Most selection strategies focus on trying to reduce uncertainty in the current model, or exploit the information obtained in order to make fine-grained distinctions. In practice, however, many search engines provide means to filter results based on meta-data labels. For example, when searching for clothing a search engine may allow you to filter results based on its category (\egpants), subcategory (\egjeans), and color, amongst others. Coupled with the initial query, this provides a strong signal to initialize an active learning algorithm. Thus, the criteria that follow focus on the exploitation of existing knowledge.

Nearest Neighbors. As a baseline, we perform an iterative nearest neighbors query to obtain candidate images. Given query image , this method returns the -nearest neighbors to that have not been previously selected. Which ever image is selected as most relevant to the target image is used as the query in the next iteration.

Feedback Constraint Satisfaction. Inspired by [17], we find the samples which satisfy the maximum number of feedback constraints provided by the user. For each iteration that a new candidate query is accepted by the user, then we know that is closer to the target image than . Analogously, if the candidate is not accepted, then we know is farther away from the target image than . These become constraints where each element is a tuple where is closer to the target image than . We define as a binary variable which indicates that we want to count the number of unsatisfied constraints (\iefor this criterion so we count satisfied constraints). Then we can calculate the portion of constraints in a candidate image satisfies, \ie,


where is an indicator function which returns one if under some distance function . Thus, our criteria for the next proposed query from the set of candidates is:


Ties are broken using nearest neighbors sampling between the candidates and the query image.

Candidate Re-ranking Methods

Many methods that measure how informative a sample is are computationally expensive, making it infeasible to run over a large database. Therefore, we begin by obtaining a short list of candidates using the methods from Section 3.2.1, then re-rank them based on how informative they are. Below we discuss two such re-ranking methods.

Expected Error Reduction. Initially proposed in [33], this refinement strategy focuses on reducing generalization error of the current model of the desired target image. Ergo, it can be seen as inherently balancing exploration and exploitation criteria. We measure the entropy of the current model by calculating the portion of constraints an image satisfies as done in Eq. (6), \ie,


We use the highest ranked item which hasn’t been presented to the user, which we denote as , as a proxy for the target image when predicting the user’s response . The simulated response either accepts or rejects some image from our short list of candidate images to create a new constraint. For example, would indicate that a constraint should be added to that says is farther away from the target image than the current query . We decide if a new constraint would be satisfied by measuring the likelihood that the candidate image shares the same attributes with the target image. The candidate images are then selected according to the following:


where converts the distances in the attribute’s embedding space to probabilities using Platt scaling [31] whose parameters we estimate using the training set. Effectively, we select the which we are the most uncertain about that is also similar to our best guess at the target image.

Learned Re-ranking Criteria. So far only hand-crafted strategies have been discussed. Learned criteria can easily adapt to the exact task and dataset, making it an attractive option. To this end, we train a Deep Q-Network (DQN) [26] with experience replay to learn how to select informative images. In this paradigm, we learn a function that estimates the reward we would get by taking some action given the current state of the system . We define as the change in the percentile rank of the target image under the current model after obtaining feedback from the user. We represent each image in the list of candidates obtained from the methods in Section 3.2.1 as the difference between its visual embedding and the query image. This is fed into our DQN as the current state , which then selects which image to present to the user (\iethe set of actions asks which image to choose). At test time, the selection criteria simply need to maximize the expected reward if we were to select image to present to the user:


Our model is trained using a Huber loss on top of the temporal difference error between expected and observed rewards.

4 Experiments

We begin by validating our image representation’s ability to identify if two images share the same attributes in Section 4.1. Then we analyze the ability of our approach to perform our active image search task in Section 4.2.

Dataset. Experiments were performed on the UT Zappos50K (UT Zap-50K) dataset [44]. This dataset consists of just over 50K images taken from the Zappos website of shoes in a canonical view and homogeneous backgrounds. Each image has eight meta-data attributes associated with it: category, closure, gender, heel, insole, material, subcategory, and toestyle. Only the category and subcategory labels are required, resulting in a sparse labeling of the remaining attributes. We split images in the dataset by their productID, keeping 5000 products for testing, 1000 for validation, and used the remaining for the training.

4.1 Attribute Embedding Experiments

Implementation Details. We generally follow the training procedure described in [39]. For each attribute in the dataset, we randomly sampled 200K triplets for training, 40K for testing, and 20K for the validation set from their respective images. We did not use the same triplets as [39], however, since they split their images randomly which could result in same product appearing in both the training and testing splits. Although we tried semi-hard negative sampling of triplets [34], it did not provide performance benefits in our experiments. The models were trained for 200 epochs with a batch size of 256. The best model is selected using the validation set. We set our parameters as the following: (, , and in Eq. (3.1.1), and in Eq. (5). We initialize our model using an 18 layer Deep Residual Network [14] that was trained on ImageNet [7]. All images are resized to before being fed into the network.

Evaluation Metric. Following [39] we report the triplet satisfaction rate of our model (\iethe percentage of valid triplets for the 320K samples in the test set).

Results. Table 1 reports our triplet satisfaction rate on the test set and compares our approach to the state-of-the-art. The first two lines of Table 1(b) show that doubling the number of training triplets for the baseline model results in a very small improvement to performance. However, the third line of Table 1(b) demonstrates that by including the normalization on the embedding outputs and forcing each mask to sum to 1 (referred to in the table as “constraints”), we can better leverage the additional training data, improving our performance by almost 2%. Our full model, which includes these constraints as well as our attention to the global similarity between images (described in Section 3.1.2) results in a 3% improvement over the baseline.

Method triplets/ Accuracy
CSN 50K 79.29
CSN 100K 79.35
CSN + constraints 100K 81.27
CSN + constraints + global similarity 100K 82.28
Table 1: Triplet satisfaction rate and number of training triplets used for the UT-Zap50K dataset [44]. Note: we used all eight meta-data labels rather than just the four reported in Veit \etal [39].
Method Category Closure Gender Heel Insole Material Subcategory Toestyle
CSN 93.69 77.17 77.27 88.49 58.64 71.53 90.21 77.31
CSN + constraints 93.07 79.80 80.15 89.40 60.35 75.00 92.90 79.46
CSN + constraints + global similarity 94.48 81.63 81.37 89.62 61.68 75.75 93.98 79.83
Table 2: Triplet satisfaction rate using models trained with 100K triplets/concept on the UT-Zap50K dataset [44] separated by attribute.
Figure 5: Visualization of the learned embedding. t-SNE [38] of the closure embedding space using our improved CSN Model. Starting from the top left and moving down and to the right, in the first two pair of boxes we see that the embedding has learned to separate sandals based on heel-height despite both being slip-ons. The next three show that the embedding has learned subcategories of shoes despite all having lace-up closure mechanisms, with the last pair showing how different closing mechanisms for boots have also been separated.

We break down the performance of our model by the attribute being learned in Table 2. The material and gender attributes reported the largest performance improvement at 4% over the baseline CSN model. The subcategory attribute brought up its performance by just over 3.5%, putting its performance more in line with the category attribute. While our model did improve the category attribute by almost 1%, including just the constraints did lower performance slightly. However, this loss was more than made up by the improvements in the other attributes, and the model which included global similarity did best across all attributes.

To provide insight into the structure of the learned embedding spaces, we provide a t-SNE visualization [38] of the closure attribute in Figure 57. The highlighted boxes show how our embedding has learned to separate shoes with heels from those without, or athletic and dress shoes, despite having the same closing mechanism. This demonstrates how our representation can make more intuitive transitions while navigating the learned embedding space.

4.2 Active Image Search Experiments

Figure 6: Active Image Search Examples. Each row is an example of the images selected by our system as we refine our search results going from our initial query to our target image which is contained by a red box


Figure 7: Comparison of our DQN Refinement strategy over an embedding trained using binary attribute labels and the Attribute Pivots method [16] which utilizes relative attribute annotations.

Implementation Details. On each iteration, we select one image per attribute type (8 total) to present to the user. For our refinement experiments, we select 4 candidates per attribute using our Feedback Constraint Satisfaction criterion (32 total) and then re-rank them to select the top 8 images. In our DQN experiments, we use a replay memory size of 20,000, a discount rate of 0.999, and a batch size of 2048. Our DQN models are trained using simulated user feedback which we describe below. During training, we begin by performing a random action 90% of the time and decay this randomized action rate exponentially until we reach 5%.

Query-Target Pairs. For each attribute in the dataset, we sample 2,000 pairs for training, 500 for validation, and 1,000 for testing resulting in 16,000, 4,000, and 8,000 pairs, respectively. Each sample was randomly selected from the set of image pairs which share at least one attribute without restrictions. This means pairs can be semantically distant from each other (\ega boot and a sandal without a heel can be sampled for pairs sharing that attribute), adding to the challenges faced by our model.

User Feedback. User feedback is simulated by averaging the Euclidean distance between the candidate image and the target image between all embedding spaces and selecting the closest one. We evaluated the appropriateness of our user feedback mechanism by presenting triplets of images to 13 human annotators. Each triplet contained a target image and a pair of images to evaluate. Annotators were asked to select the image from the pair which most resembles the target image. We obtained 100 triplets selected at random from our active learning experiments (\iethey exactly reflected the decision process made by our algorithm). After removing triplets where at least 5 annotators disagreed with the majority, our simulated input agreed with the human annotators 79% of the time over the remaining 86 samples. Human performance was similar, as individual annotators agreed with the majority 74-88% of the time (84% average). We also experimented with two other embeddings to use for our simulated feedback: features from our fine-tuned ResNet-18 model initially trained on ImageNet and the unmasked embedding representation output by our model. Relative performance remained consistent across feedback types in our active learning experiments.

Comparison to Relative Attribute Approaches. In addition to our own baselines, we also adapt our embedding approach to produce relative attribute scores for four common concepts using the annotations provided in [44]. We encode a pair of images using the Conditional Similarity Network to obtain an embedding representation for an attribute. We concatenate together the embedding from each image which is fed into a fully connected layer followed by a softmax with an output dimension of 3 and is trained jointly with the embeddings. The output of this model indicates if image has more, less, or the same amount of an attribute as image . Using this model, we reproduce the binary tree EER-based approach of [16]. We initialize the activate image search model using the query image provided at test time. The remaining implementation details follow [16]. We shall refer to this reproduction as Attribute Pivots henceforth.

Evaluation Metric. We evaluate performance based on how many iterations were required to go from the initial query to the target image.

Results. We report our active image search performance in Table 38. As seen in the top two lines of the table, using the feedback constraint satisfaction criterion reduces the number of iterations required to find a target image by 2 over the nearest neighbors baseline. Refining the top 4 candidates using expected error reduction reduces the number of steps by 1.5, and our DQN refinement reduces this further, making the total reduction approach 5 iterations fewer than the baseline. It is important to note that this would be considered the toughest settings for this task. In practice, a user could remove a lot of images from consideration by filtering by the meta-data labels.

Selection Strategy #Steps
Nearest Neighbor 26.40
Feedback Constraint Satisfaction 24.62
Expected Error Reduction 23.07
DQN Refinement 21.79
Table 3: Active image search performance on UT Zap-50K.

We provide examples of the images selected by our system as we refine image search results in Figure 6. In the first row, we see how the boots change in style on each iteration, deciding on the heel first before refining the style of the boot. The second row demonstrates how our system is capable of even changing the category of the shoe, traversing from a boot to a sneaker. The third row shows how the system switched from changing the style of shoe to the type of closing mechanism before locating the target image, with the last two rows demonstrating how our system can handle even relatively fine-grained differences between the initial query and the target image.

A good search refinement algorithm need not produce the exact target image in the refinement stage, but simply obtain enough information that the image ranks sufficiently high in the search results. To this end, we provide the rank of the target image per iteration in Figure 7(a) with a comparison to our implementation of the Attribute Pivots approach [16] which takes advantage of relative attribute annotations. Here we see the two methods perform comparably despite our method using only binary attribute labels. It is important to note that Attribute Pivots produced a more consistent algorithm as exemplified with the lower per-iteration rank standard deviation seen in Figure 7(b), even if the average performance was slightly lower than ours. This may be due to the limited number of relative attributes available in the UT Zap-50K dataset, which suggests it would beneficial to further explore the trade-off between annotation cost vs. performance in future work.

4.3 OSR Experiments

To demonstrate our approach’s ability to generalize we provide experiments on the Outdoor Scene Recognition (OSR) dataset [28]. This dataset consists of 2688 images with six attributes annotated for the eight scene categories. We randomly sampled 400 images for the test set (50/category), 160 images for the validation set (20/category), and used the rest for training. To train our representation, we randomly sampled 100K triplets for training, 40K for testing, and 20K for validation. In our active image search experiments we sampled 16,000 pairs for training, 4,000 pairs for testing, and 800 pairs for validation. All other settings are the same that were used for the UT Zap-50K dataset.

Results. As seen in our attribute experiments in Table 4, our additional constraints and global similarity enhancements provide a 1.5% and 2.5% improvement over the baseline, respectively. Our results on the active image search task in Table 5 also follow the results on UT Zap-50K, where our DQN refinement strategy outperforms the EER alternative as well as the feedback constraint satisfaction and nearest neighbor baselines. Despite the OSR dataset being from a very different domain from UT Zap-50K, our model still provides a performance improvement over prior work.

Method Accuracy
CSN 96.84
CSN + constraints 98.58
CSN + constraints + global similarity 99.42
Table 4: Triplet satisfaction rate on the OSR dataset.
Selection Strategy #Steps
Nearest Neighbor 6.21
Feedback Constraint Satisfaction 5.57
Expected Error Reduction 4.92
DQN Refinement 4.54
Table 5: Active image search performance on the OSR dataset

5 Conclusion

In this paper, we addressed the problem of active image search, but without expensive annotations or user requirements used in prior work. Instead, we introduced enhancements to the Conditional Similarity Network which improved its ability to make relative attribute comparisons. We used this representation in our experiments on active image search where we demonstrated the effectiveness of our DQN selection criterion and showed it was competitive with prior work which used expensive relative attribute annotations. In future work, we would like to build upon our current system by taking advantage of a hierarchical clustering method to organize our data which has proven useful in prior work [16, 23]. Our model could also benefit from taking into account the diversity of selected images on each iteration by incorporating elements used in batch mode active learning approaches (\eg [9, 13, 19]).


  1. footnotemark:
  2. footnotemark:
  3. footnotemark:
  4. footnotemark:
  5. footnotemark:
  6. footnotemark:
  7. The closure attribute embedding is also provided in Figure 5(a) of Veit \etal [39] which mixes sandals, heels, and slippers in the same local space when not encouraging globally consistent embeddings.
  8. We don’t include a comparison to Attribute Pivots [16] in Table 3 as it regularly satisfied its stopping criterion (\ie it was no longer able to improve its model) before finding the target image. Altering the stopping criterion so that it would only stop when it located the target image resulted in poor performance on this task.


  1. Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. JMLR, 5:255–291, 2004.
  2. S. Branson, C. Wah, F. Schro, B. Babenko, P. Welinder, P. Perona, and S. Belongie. Visual recognition with humans in the loop. In ECCV, 2010.
  3. Y. Cao, H. Wang, C. Wang, L. Zhang, L. Zhang, and Z. Li. Mindfinder: Interactive sketch-based image search on millions of images. In ACM Multimedia, 2010.
  4. L. Chen, Q. Zhang, and B. Li. Predicting multiple attributes via relative multi-task learning. In CVPR, 2014.
  5. O. Chum, J. Philbin, and J. Sivic. Total recall: Automatic query expansion with a generative feature model for object retrieval. In ICCV, 2007.
  6. I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos. The bayesian image retrieval system, pichunter: Theory, implementation, and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 2000.
  7. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  8. S. Ebert, M. Fritz, and B. Schiele. RALF: A reinforced active learning formulation for object class recognition. In CVPR, 2012.
  9. E. Elhamifar, G. Sapiro, A. Yang, and S. Shankar Sasrty. A convex optimization framework for active learning. In ICCV, 2013.
  10. M. Ferecatu and D. Geman. Interactive search for image categories by mental matching. In ICCV, 2007.
  11. J. Fogarty, D. Tan, A. Kapoor, and S. Winder. Cueflik: Interactive concept learning in image search. In CHI, 2008.
  12. Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, 2011.
  13. Y. Guo and D. Schuurmans. Discriminative batch mode active learning. In NIPS, 2008.
  14. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  15. R. He, C. Packer, and J. McAuley. Learning compatibility across categories for heterogeneous item recommendation. In ICDM, 2016.
  16. A. Kovashka and K. Grauman. Attribute pivots for guiding relevance feedback in image search. In ICCV, 2013.
  17. A. Kovashka, D. Parikh, and K. Grauman. Whittlesearch: Interactive image search with relative attribute feedback. IJCV, 115(2):185–210, 2015.
  18. A. Kovashka, S. Vijayanarasimhan, and K. Grauman. Actively selecting annotations among objects and attributes. In ICCV, 2011.
  19. A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2–3), 2012.
  20. S. Lad and D. Parikh. Interactively guiding semi-supervised clustering via attribute-based explanations. In ECCV, 2014.
  21. B. Li, E. Chang, and C.-S. Li. Learning image query concepts via intelligent sampling. In ICME, 2001.
  22. C. Long and G. Hua. Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In ICCV, 2015.
  23. O. Mac Aodha, N. D. Campbell, J. Kautz, and G. J. Brostow. Hierarchical Subquery Evaluation for Active Learning on a Graph. In CVPR, 2014.
  24. S. D. MacArthur, C. E. Brodley, and C.-R. Shyu. Relevance feedback decision trees in content-based image retrieval. In IEEE Workshop on Content-Based Access of Image and Video Libraries, 2000.
  25. T. Mensink, J. Verbeek, and G. Csurka. Learning structured prediction models for interactive image labeling. In CVPR, 2011.
  26. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.
  27. B. Modi and A. Kovashka. Confidence and diversity for active selection of feedback in image retrieval. In BMVC, 2017.
  28. A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42:145–175, 2001.
  29. T. Osugi, D. Kun, and S. Scott. Balancing exploration and exploitation: A new algorithm for active machine learning. In ICDM, 2005.
  30. D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.
  31. J. C. Platt. Probabilistic output for support vector machines and comparisons to regularized likelihood method. In Advances in Large Margin Classifier, 1999.
  32. N. Rasiwasia, P. J. Moreno, and N. Vasconcelos. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 9(5):923–938, 2007.
  33. N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In ICML, 2001.
  34. F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  35. K. K. Singh and Y. J. Lee. End-to-end localization and ranking for relative attributes. In ECCV, 2016.
  36. Y. Souri, E. Noury, and E. Adeli. Deep relative attributes. In ACCV, 2016.
  37. S. Tong and E. Chang. Support vector machine active learning for image retrieval. In ACM Multimedia, 2001.
  38. L. van der Maaten and G. E. Hinton. Visualizing high-dimensional data using t-SNE. JMLR, 9:2579–2605, 2008.
  39. A. Veit, S. Belongie, and T. Karaletsos. Conditional similarity networks. In CVPR, 2017.
  40. Y. Wang, S. Wang, J. Tang, H. Liu, and B. Li. PPP: Joint pointwise and pairwise image label prediction. In CVPR, 2016.
  41. K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
  42. P.-Y. Yin, B. Bhanu, K.-C. Chang, and A. Dong. Reinforcement learning for combining relevance feedback techniques. In ICCV, 2003.
  43. P.-Y. Yin, B. Bhanu, K.-C. Chang, and A. Dong. Integrating relevance feedback techniques for image retrieval using reinforcement learning. TPAMI, 27(10):1536–1551, 2005.
  44. A. Yu and K. Grauman. Fine-Grained Visual Comparisons with Local Learning. In CVPR, 2014.
  45. A. Yu and K. Grauman. Just noticeable differences in visual attributes. In ICCV, 2015.
  46. A. Yu and K. Grauman. Semantic jitter: Dense supervision for visual comparisons via synthetic images. In ICCV, 2017.
  47. Q. Yu, F. Liu, Y.-Z. SonG, T. Xiang, T. Hospedales, and C. C. Loy. Sketch me that shoe. In CVPR, 2016.
  48. H. Zhang, Z.-J. Zha, S. Yan, J. Bian, and T.-S. Chua. Attribute feedback. In ACM Multimedia, 2012.
  49. B. Zhao, J. Feng, X. Wu, and S. Yan. Memory-augmented attribute manipulation networks for interactive fashion search. In CVPR, 2017.
  50. X. Zhou and T. Huang. Relevance feedback in image retrieval: A comprehensive review. ACM Multimedia Systems, 2003.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description