Structured Visual Search via Composition-aware Learning
This paper studies visual search using structured queries. The structure is in the form of a D composition that encodes the position and the category of the objects. The transformation of the position and the category of the objects leads to a continuous-valued relationship between visual compositions, which carries highly beneficial information, although not leveraged by previous techniques. To that end, in this work, our goal is to leverage these continuous relationships by using the notion of symmetry in equivariance. Our model output is trained to change symmetrically with respect to the input transformations, leading to a sensitive feature space. Doing so leads to a highly efficient search technique, as our approach learns from fewer data using a smaller feature space. Experiments on two large-scale benchmarks of MS-COCO  and HICO-DET  demonstrates that our approach leads to a considerable gain in the performance against competing techniques.
Visual image search is a core problem in computer vision, with many applications, such as organizing photo albums , online shopping , or even in robotics [37, 2]. Two popular means of searching for images are either text-to-image [25, 5] or image-to-image [40, 53]. While simple, text-based search could be limited in representing the intent of the users, especially for the spatial interactions of objects. Image-based search can represent the spatial interactions, however, an exemplar query may not be available at hand. Due to these limitations, in our work, we focus on a structured visual search problem of compositional visual search.
The composition is one of the key elements in photography . It is the spatial arrangement of the objects within the image plane. Therefore, composition offers a natural way to interact with large image databases. For example, a big stock image company already offers tools for its users to find images from their databases by composing a query . The users compose an abstract, 2D image query where they arrange the location and the category of the objects of interest, see Figure 1.
Compositional visual search is initially tackled as a learning problem , recently using deep Convolutional Neural Networks (CNN) . Mai \etaltreats the problem as a visual feature synthesis task where they learn to map a given D query canvas to a dimensional feature representation using binary metric learning which is then used for querying the database . We identify the following limitations with this approach: i) The method requires a large-dimensional feature () to account for the positional and categorical information of the input objects, limiting the memory efficiency especially while searching across large databases. ii) The method requires a large-scale dataset ( images) for training, limiting the sample efficiency. iii) The method only considers binary relations between images, limiting the compositional-awareness. To overcome these limitations, in our work, we introduce composition-aware learning.
Compositional queries exhibit continuous-valued similarities between each other. Objects within the queries transform in two major ways: 1) Their positions change (translational transformation), 2) Their categories change (semantic transformation), see Figure 1. Our composition-aware learning approach takes advantage of such transformations using the principle of equivariance, see Figure 2. Our formulation imposes the transformations within the input (query) space to have a symmetrical effect within the output (feature) space. To that end, we develop novel representations of the input and the output transformations, as well as a novel loss function to learn these transformations within a continuous range.
Our contributions are three-fold:
We introduce the concept of composition-aware learning for structured image search.
We illustrate that our approach is efficient both in feature-space and data-space.
2 Related Work
Compositional Visual Search. Visual search mostly focused on text-to-image [32, 5, 4, 44, 47, 25] or image-to-image [12, 46, 17, 24, 42, 26, 11, 1, 40, 53, 41] search. Text-to-image is limited in representing the user intent, and a visual query may not be available for image-to-image search. Recent variants also combine the compositional query either with text  or image . In this paper, we focus on compositional visual search [51, 36, 30]. A user composes an abstract, 2D query representing the objects, their categories, and relative locations which is then used to search over a potentially large database. A successful example is VisSynt  where the authors treat the task as a visual feature synthesis problem using a triplet loss function. Such formulation is limited in the following ways: 1) VisSynt is high dimensional in feature-space ( dimensional), limiting memory efficiency, 2) VisSynt requires a large training set ( examples), limiting data efficiency, 3) VisSynt does not consider the compositional transformation between queries due to binary nature of the triplet loss , limiting the generalization capability of the method. In our work, inspired by the equivariance principle, we propose composition-aware learning to overcome these limitations and test our efficiency and accuracy on two well-established benchmarks of MS-COCO  and HICO-DET .
Learning Equivariant Transformations. Equivariance is the principle of the symmetry: Any change within the input space leads to a symmetrical change within the output space. Such formulation is highly beneficial, especially for model and data efficiency . In computer vision, equivariance is used to represent transformations such as object rotation [7, 49, 48], object translation [52, 50, 19, 31] or discrete motions [15, 16]. Our composition-aware learning approach is inspired by these works, as we align the continuous transformation between the input (query) and output (feature) spaces, see Figure 2.
Continuous Metric Learning. Continuous metric learning takes into account the continuous transformations between the image instances [23, 34, 22], since such relationships can not be modeled with conventional metric learning techniques [6, 14]. Recently, Kim \etal  proposed LogRatio, a loss function that matches the relative ratio of the input similarities with the output feature similarities. It yields significant gain over competing methods for pose and image caption search. Since compositional visual search is a continuous-valued problem, we bring LogRatio as a strong baseline to this problem. LogRatio intrinsically assumes a dense set of relevant images given an anchor point for an accurate estimation. However, compositional visual search follows Zipf distribution , where, given a query, only a few images are relevant, limiting LogRatio performance.
3 Composition-aware Learning
Our method consists of three building blocks:
Composition-aware transformation that computes the transformations in the input and output space,
Composition-aware loss function that updates the network parameters according to the divergence of input-output transformations,
Composition-equivariant CNN, used as the backbone to learn the transformation.
Method Overview. An overview of our method is provided in Figure 3. Our method takes as an input a 2D compositional query , where are the height and width of the query canvas. This query contains a set of objects, along with their categories and positions (in the form of bounding boxes). The goal of our method is, given a target dataset of images, we want to retrieve the top-k images that are most relevant to the query – \ierelevant to both the objects and their positions. Each image can initially be represented as feature using the last convolutional layer of an off-the-shelf, ImageNet pre-trained deep CNN, \egResNet- . Such feature preserves the spatial information as well as the object category information within the image . Furthermore, we assume access to a tuple , where is a compositional map constructed using the object categories and bounding boxes of the query . In addition, let be the transformed version of the query , and are the corresponding composition map, CNN feature and the image. The transformation can correspond to a translation of object location(s), or a change in object categories in . Our method trains a -layer CNN with the parameters , by minimizing the following objective function:
where measures the input transformation between compositional maps and , and measures the transformation between output feature maps and , and is the composition-aware loss function measuring the discrepancy between these transformations. In the following, we first describe the compositional map , and the input and the output transformations and . Then, we describe composition-aware loss function . Finally, we describe our CNN architecture that learns the mapping. We drop from now for the sake of clarity.
3.1 Composition-aware Transformation
The goal of the composition-aware transformation is to quantify the amount of transformation between the input compositions and output feature maps in the range . For this, first, we construct compositional maps from the input user queries, then we measure the input transformation using these maps, and finally we describe the output transformation.
Constructing compositional map . First, given a user query that reflects the category and the position of the objects, we create a one-hot binary feature map of size where are the spatial dimension of the composition map (), and is the number of object categories (\ie for MS-COCO ). In this map, only the corresponding object locations and the categories are set to and otherwise . This simple map encodes both the positional and categorical information of the input composition, which we will then use to measure the transformation within the input space. We apply the same procedure to the transformed query which yields . Now given the pair of compositional maps , we can quantify the input transformation.
Input transformation . Then, our goal is to measure the similarity between these two compositions as:
where is an indicator function that is for only non-zero pixels. This simple expression captures the proportion of the intersection of the same-category object locations in the numerator and the union of the same-category object locations in the denominator. output is in the range , and will return if the two compositions and are identical in terms of object location and the categories, and if no objects share the same location. will smoothly change with the translation of the input objects in the compositions. Given the input transformation, we now need to compute the output transformation which will then be correlated with the changes within the input space.
Output transformation . Output transformation is computed as the dot product between the output features as follows:
where is the transpose of the output feature . We choose the dot product due to its simplicity and convenience in a visual search setting. can take arbitrary values in the range . In the following, we describe how to bound these values and measure the discepancy between the input-output transformations and .
3.2 Composition-aware Loss
Given the input-output transformations, we can now compute their discrepancy to update the parameters of the network . A naive way to implement this would be to minimize the Euclidean distance between the input-output transformations as:
where is the exponential non-linearity to bound in range . However, such a function generates unbounded gradients therefore leading to instabilities during training , and reducing the performance, as we show through our experiments. Instead, cross entropy is a stable and widely used function that is used to update the network weights. However, cross entropy can only consider binary labels as whereas in our case the transformation values vary within . To that end, we derive a new loss function inspired by the cross entropy that can still consider in-between values.
Consider that our goal is to maximize the correlation between input-output transformations as:
We can also equivalently minimize the negative of this expression due to convenience:
The divergence of and at the beginning of the training leads to instabilities during the training. To overcome this, we include additional regularization via the following two terms as:
where the two terms and penalize for larger values of in the beginning of the training, leading to lesser divergence from . To further avoid over-flow, the final form of the regularizer terms are:
This is the final expression for which we use throughout the training of our network .
3.3 Composition-Equivariant Backbone
Our model is a lightweight -layers CNN that maps the bottleneck representation obtained from the pre-trained network ResNet- of dimension to a smaller channel dimension of the same spatial size, \ie, such as unless otherwise stated. Our intermediate convolutions are . The first two convolutions use kernels whereas the last layer uses . We use stride and apply zero-padding to preserve the spatial dimensions which are crucial for our task. We use with slope parameter , batch-norm and dropout with in between layers. We do not apply any batch-norm, dropout, or at the output layer as this leads to inferior results.
Since our goal is to preserve positional and categorical information, a network with standard layers may not be a proper fit. Convolution and pooling operations in standard networks are shown to be lacking translation (shift) equivariance, contrary to wide belief . To that end, we use the anti-aliasing trick suggested by  to preserve shift equivariance throughout our network. Specifically, before computing each convolution, we apply a Gaussian blur on top of the feature map. This simple operation helps to keep translation information within the network layers.
4 Experimental Setup
Constructing Queries. To evaluate our method objectively, without relying on user queries and studies, we rely on large-scale benchmarks with bounding box annotations. We evaluate our method on MS-COCO  and HICO-Det . The training is only conducted on MS-COCO. Given an image, we select at most objects based on their area as is the best practice in .
MS-COCO. MS-COCO is a large-scale object detection benchmark. It exhibits object categories such as animals (\iedog, cat, zebra, horse) or house-hold objects. The dataset contains training and validation images. We split the training set to two mutually exclusive random sets of training and gallery images. The number of objects in each image differs in the range .
HICO-DET. HICO-DET is a large-scale Human-object interaction detection benchmark [3, 20]. HICO-DET builds upon MS-COCO object categories, and collects interactions for different verbs, such as ride, hold, eat or jump, for unique <verb, noun> combinations. Interactions exhibit fine-grained spatial configurations which makes it a challenging test for the compositional search. The dataset includes training and testing images. The training images are used as the gallery set and the testing set is used as the query set. A unique property of the dataset is that interactions have less than examples in the training set, which means a query can only match very few images within the gallery set, leading to a challenging visual search setup . HICO-DET is only used for evaluation.
4.2 Evaluation Metrics
We evaluate the performance of the proposed model with three metrics. Standard mean Average Precision metric as is used in VisSynt . Also, we borrow continuous Normalized Discounted Cumulative Gain (cNDCG) and mean Relevance (mREL) metrics used in continuous metric learning literature [23, 34, 22] All metric values are based on the mean Intersection-over-Union (mIOU) scores between a query and all gallery images described below. For all three metrics, higher indicates better performance.
To measure the relevance between a query and a retrieved image, we resort to mean Intersection-over-Union as is the best practice . Concretely, to measure the relevance between a Query and a retrieved image
where and represents all the available objects in the query and retrieved image respectively, is an indicator function that checks whether objects and are from the same class , which is then multiplied with the intersection-over-union between these two regions. This way, the metric measures both the spatial and semantic localization of the query object.
mAP. Based on the relevance score, we use mean Average Precision to measure the retrieval performance. We first use a heuristic relevance threshold as recommended in , to convert continuous relevance values to discrete labels. Then, we measure the mAP values .
mAP metric does not respect the continuous nature of the compositional visual search since it binarizes continuous relevance values with a heuristic threshold. To that end, we resort to two additional metrics, continuous adaptation of NDCG and mean Relevance values which are used to evaluate continuous-valued metric learning techniques in [23, 34, 22].
cNDCG. We make use of the continuous adaptation of the Normalized Discounted Cumulative Gain as follows:
that takes into account both the rank and the scores of the retrieved images and the ground truth relevance scores. In our experiments we report cNDCG.
mREL. mREL measures the mean of the relevance scores of the retrieved images per query, which is then averaged over all queries. In our experiments, we report mREL. We also note the oracle performance where we assume access to the ground truth mIOU values to illustrate the upper bound in the performance.
4.3 Performance Comparison
ResNet- . We use the activations from layer- of ResNet- to retrieve images. In this work, we build upon this feature since it captures the object semantics and positions within the feature map of size . We also experimented with the earlier layers, however we found that layer- performs the best. The network is pre-trained on ImageNet .
Textual. We assume access to the ground truth object labels for a query and retrieve images that contain the same set of objects. This acts as a textual query baseline and is blind to the spatial information.
VisSynt . This baseline uses a triplet loss formulation coupled with a classification loss to perform a compositional visual search. We use the same backbone architecture and the same target feature ResNet- to train this baseline for a fair comparison.
LogRatio . This method is the state-of-the-art technique in continuous metric learning, originally evaluated on human pose and image caption retrieval. In this work, we bring this technique as a strong baseline since the visual composition space also exhibits continuous relationships. We use the authors code
Implementation details. We use PyTorch  to implement our method. We use the same backbone () and the input feature (ResNet-) for all the baselines. All the models are trained for epochs using SGD with momentum (). We use an initial learning rate of which is decayed exponentially with at every epoch. We use weight decay () for regularization. In practice, we compute input-output transformations between all examples within the batch to get the best out of each batch. We set the batch size to , and given each query in the batch, we sample highly relevant and less relevant examples for each query, which leads to an effective batch size of .
In this Section, we present our experiments. For Experiments , we use all three metrics . For the third experiment of the State-of-the-Art comparison, we provide performance at different values.
5.1 Ablation of Composition-aware Learning
It is observed that Composition-aware loss outperforms Euclidean alternative by a large-margin, confirming the effectiveness of the proposed loss function.
Lingual vs. Visual transformation. In our second ablation study, we test the domain of the input transformation (Eq 2). In our work, we proposed a visual-based input transformation whereas VisSynt  utilizes a lingual-based input transformation using semantic Word2vec embeddings . As can be seen from Table 2, vision-based transformation outperforms the lingual counterpart, since it can better encode the relationships within the visual world.
5.2 Feature and Data Efficiency
In this experiment, we test the efficiency. Specifically, we first test the feature-space efficiency to see how the performance changes with varying sizes of the query embedding. Second, we test the data-space efficiency by sub-sampling the training data.
Feature-space efficiency. We change the feature embedding size by varying the number of channels as by keeping the spatial dimension of . We compare our approach to VisSynt  and LogRatio . The results can be seen from Figure 4.
As can be seen, our approach performs the best for all metrics and across all feature sizes. This indicates that composition-aware learning is effective even when the feature size is compact (\ie). Another observation is that the performance of increases with the increased feature size, whereas the performance of the two other techniques is lower. This indicates that can leverage bigger feature sizes while other objectives tend to over-fit.
It is concluded that is a feature-efficient approach for compositional visual search.
Data-space Efficiency. In this experiment we vary the number of training data as . The results can be seen from Figure 5.
Our method performs the best regardless of the training size. The gap in the performance is even more significant when the training set size is highly limited (\ie only), confirming the data efficiency of the proposed approach.
It is concluded that can learn more from fewer examples by leveraging the continuous-valued transformations and the regularized loss function.
5.3 Comparison with the State-of-the-Art
As can be seen, our method outperforms the compared baselines in both datasets, and in 3 metrics. This confirms the effectiveness of composition-aware learning for object (MS-COCO) and object-interaction (HICO-DET) search. The results in HICO-DET are much lower compared to MS-COCO since 1) HICO-DET has a higher number of query images ( vs. ), 2) Many queries have only a few relevant images within the gallery set (as can be seen from the oracle performance of only mREL in Figure 7), 3) No training is conducted on HICO-DET, revealing the transfer-learning abilities of the evaluated techniques.
Qualitative analysis. Lastly, we showcase a few qualitative examples in Figure 8. First, as a sanity check, we illustrate single object queries (stop signs). As can be seen, our method successfully retrieves images relevant to the query category and the position. Then, we illustrate some object-interaction examples, such as human-on-bench, or human-with-tennis racket, or human-on-skateboard. Our model can still generalize to such examples, meaning that compositional learning benefits the case of the object interaction. We illustrate a failure case in the last row, where our model retrieves a mix of snowboard-skateboard objects given the query of a skateboard. This indicates that our model performance can be improved by incorporating scene context, which we leave as future work.
In this work we tackled a structured visual search problem called compositional visual search. Our approach is based on the observation that the visual compositions are continuous-valued transformations of each other, carrying rich information. Such transformations mainly consists of the positional and categorical changes within the queries. To leverage this information, we proposed composition-aware learning which consists of the representation of the input-output transformations as well as a new loss function to learn these transformations. Our experiments reveal that defining the transformations within the visual domain is more useful than the lingual counterpart. Also, a regularized loss function is necessary to learn such transformations. Leveraging transformations with this loss function leads to an increase in the feature and data efficiency, and outperforms existing techniques on MS-COCO and HICO-DET. We hope that our work will inspire further research to incporporate structure for the structured visual search problems.
- (2015) Aggregating local deep features for image retrieval. In ICCV, Cited by: §2.
- (2017) Efficient retrieval of arbitrary objects from long-term robot observations. RAS. Cited by: §1.
- (2018) Learning to detect human-object interactions. In WACV, Cited by: Structured Visual Search via Composition-aware Learning, item 3, §2, §4.1, §4.1, Figure 7.
- (2019) See-through-text grouping for referring image segmentation. In ICCV, Cited by: §2.
- (2019) Cross-modal image-text retrieval with semantic consistency. In ACM MM, Cited by: §1, §2.
- (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, Cited by: §2.
- (2016) Group equivariant convolutional networks. In ICLR, Cited by: §2.
- (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §4.3.
- (2020) Spin weighted spherical cnns. arXiv preprint arXiv:2006.10731. Cited by: §2.
- (2019) Efficient and interactive spatial-semantic image retrieval. MTA. Cited by: §2.
- (2016) Deep image retrieval: learning global representations for image search. In ECCV, Cited by: §2.
- (2019) Towards optimal cnn descriptors for large-scale image retrieval. In ACM MM, Cited by: §2.
- (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3, §4.3.
- (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §2, §2.
- (2015) Learning image representations tied to ego-motion. In ICCV, Cited by: §2.
- (2016) Slow and steady feature analysis: higher order temporal coherence in video. In CVPR, Cited by: §2.
- (2017) Class-weighted convolutional features for visual instance search. arXiv preprint arXiv:1707.02581. Cited by: §2.
- (2015) Visual search at pinterest. In ACM SIGKDD, Cited by: §1.
- (2020) On translation invariance in cnns: convolutional layers can exploit absolute spatial location. In CVPR, Cited by: §2.
- (2020) Self-selective context for interaction recognition. arXiv preprint arXiv:2010.08750. Cited by: §4.1.
- (2020) Diagnosing rarity in human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 904–905. Cited by: §4.1.
- (2019) Deep metric learning beyond binary supervision. In CVPR, Cited by: §2, §4.2.2, §4.2, §4.3, §5.2.
- (2016) Thin-slicing for pose: learning to understand pose without explicit pose estimation. In CVPR, Cited by: §2, §4.2.2, §4.2.
- (2018) Dynamicity and durability in scalable visual instance search. arXiv preprint. Cited by: §2.
- (2019) Visual semantic reasoning for image-text matching. In ICCV, Cited by: §1, §2.
- (2018) Nonlinear embedding neural codes for visual instance retrieval. Neurocomputing. Cited by: §2.
- (2019) Spherical regression: learning viewpoints, surface normals and 3d rotations on n-spheres. In CVPR, Cited by: §3.2.
- (2014) Microsoft coco: common objects in context. In ECCV, Cited by: Structured Visual Search via Composition-aware Learning, item 3, §2, §3.1, §4.1, Figure 6.
- (2020) Spatial-content image search in complex scenes. In WACV, Cited by: §2.
- (2017) Spatial-semantic image search by visual feature synthesis. In CVPR, Cited by: §1, §2, §4.1, §4.2.1, §4.2.2, §4.2, §4.3, §5.1, §5.2.
- (2017) Rotation equivariant vector field networks. In ICCV, Cited by: §2.
- (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In ICCV, Cited by: §2.
- (2013) Efficient estimation of word representations in vector space. arXiv preprint. Cited by: §5.1.
- (2015) Pose embeddings: a deep architecture for learning to match human poses. arXiv preprint. Cited by: §2, §4.2.2, §4.2.
- (2005) Power laws, pareto distributions and zipf’s law. Contemporary physics. Cited by: §2.
- (2019) Representation and retrieval of images by means of spatial relations between objects.. In AAAI, Cited by: §2.
- (2015) Retrieving experience: interactive instance-based learning methods for building robot companions. In ICRA, Cited by: §1.
- (2017) Automatic differentiation in pytorch. Cited by: §4.3.
- (2015) Learning to see creatively: design, color, and composition in photography. Amphoto Books. Cited by: §1.
- (2016) CNN image retrieval learns from bow: unsupervised fine-tuning with hard examples. In ECCV, Cited by: §1, §2.
- (2018) Fine-tuning cnn image retrieval with no human annotation. PAMI. Cited by: §2.
- (2016) Visual instance retrieval with deep convolutional networks. TMTA. Cited by: §2.
- (2003) How do people manage their digital photographs?. In SIGCHI, Cited by: §1.
- (2019) Adversarial representation learning for text-to-image matching. In ICCV, Cited by: §2.
- (2017) Shutterstock compositional visual search. In https://www.shutterstock.com/blog/composition-aware-search-tool, Cited by: §1.
- (2015) Attributes and categories for generic instance search from one example. In CVPR, Cited by: §2.
- (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In ICCV, Cited by: §2.
- (2019) General e (2)-equivariant steerable cnns. In NeurIPS, Cited by: §2.
- (2018) Learning steerable filters for rotation equivariant cnns. In CVPR, Cited by: §2.
- (2018) Cubenet: equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 567–584. Cited by: §2.
- (2010) Image search by concept map. In SIGIR, Cited by: §1, §2.
- (2019) Making convolutional networks shift-invariant again. ICML. Cited by: §2, §3.3.
- (2017) SIFT meets cnn: a decade survey of instance retrieval. PAMI. Cited by: §1, §2.