Affinity Graph Supervision for Visual Recognition

Affinity Graph Supervision for Visual Recognition

Abstract

Affinity graphs are widely used in deep architectures, including graph convolutional neural networks and attention networks. Thus far, the literature has focused on abstracting features from such graphs, while the learning of the affinities themselves has been overlooked. Here we propose a principled method to directly supervise the learning of weights in affinity graphs, to exploit meaningful connections between entities in the data source. Applied to a visual attention network [7], our affinity supervision improves relationship recovery between objects, even without the use of manually annotated relationship labels. We further show that affinity learning between objects boosts scene categorization performance and that the supervision of affinity can also be applied to graphs built from mini-batches, for neural network training. In an image classification task we demonstrate consistent improvement over the baseline, with diverse network architectures and datasets.

\cvprfinalcopy

1 Introduction

Recent advances in graph representation learning have lead to principled approaches for abstracting features from such structures. In the context of deep learning, graph convolutional neural networks networks (GCNs) have shown great promise [1, 13]. The affinity graphs in GCNs, whose nodes represent entities in the data source and whose edges represent pairwise affinity, are usually constructed from a predefined metric space and are therefore fixed during the training process [1, 13, 20, 28]. In related work, self-attention mechanisms [26] and graph attention networks [27] have been proposed. Here, using pairwise weights between entities, a fully connected affinity graph is used for feature aggregation. In contrast to the graphs in GCNs, the parametrized edge weights change during the training of the graph attention module. More recent approaches also consider elaborate edge weight parametrization strategies [11, 16] to further improve the flexibility of graph structure learning. However, the learning of edge (attention) weights in the graph is entirely supervised by a main objective loss, to improve performance in a downstream task.

Whereas representation learning from affinity graphs has demonstrated great success in various applications [7, 33, 29, 9, 8], little work has been done thus far to directly supervise the learning of affinity weights. In the present article, we propose to explicitly supervise the learning of the affinity graph weights by introducing a notion of target affinity mass, which is a collection of affinity weights that need to be emphasized. We further propose to optimize a novel loss function to increase the target affinity mass during the training of a neural network, to benefit various visual recognition tasks. The proposed affinity supervision method is generalizable, supporting flexible design of supervision targets according to the need of different tasks. This feature is not seen in the related works, since the learning of such graphs are either constrained by distance metrics [11] or dependent on the main objective loss [26, 27, 16].

With the proposed supervision of the learning of affinity weights, a visual attention network [7] is able to compete in a relationship proposal task with the present state-of-the-art [32] without any explicit use of relationship labels. Enabling relationship labels provides an additional 25% boost over [32] in relative terms. This improved relationship recovery is particularly beneficial when applied to a scene categorization task, since scenes are comprised of collections of distinct objects. We also explore the general idea of affinity supervised mini-batch training of a neural network, which is common to a vast number of computer vision and other applications. For image classification tasks we demonstrate a consistent improvement over the baseline, across multiple architectures and datasets. Our proposed affinity supervision method leads to no computational overhead, since we do not introduce additional parameters.

2 Related Work

Figure 1: A comparison of recovered relationships on test images, with no relationship annotations used during training. We shows the reference object (blue box), regions with which it learns relationships (orange boxes) and the relationship weights in red text (zoom in on the PDF). Left: baseline visual attention networks [7] often recover relationships between a reference object and its immediate surrounding context. Right: our proposed affinity supervision better emphasizes potential relationships between distinct and spatially separated objects.

2.1 Graph Convolutional Neural Networks

In GCNs layer-wise convolutional operations are applied to abstract features in graph structures. Current approaches build the affinity graph from a predefined input [1, 20, 28] or embedding space [5, 26], following which features are learned using graph based filtering in either the spatial or spectral domain. Little work has been carried out so far to directly learn the structure of the affinity graph itself. In this article, we propose a generic method for supervising the learning of pairwise affinities in such a graph, without the need for additional ground truth annotations.

2.2 Visual Attention Networks

Attention mechanisms, first proposed in [26], have been successfully applied to a diverge range of computer vision tasks [7, 33, 29]. In the context of object detection [7], the attention module uses learned pairwise attention weights between region proposals, followed by per region feature aggregation, to boost object detection. The learned attention weights do not necessarily reflect relations between entities in a typical scene. In fact, for a given reference object (region), relation networks [7] tend to predict high attention weights with scaled or shifted bounding boxes surrounding the same object instance (Figure 1).

A present limitation of visual attention networks is their minimization of only the main objective loss during training [7, 33, 29], without any direct supervision of attention between entities. Whereas attention based feature aggregation has been shown to boost performance for general vision tasks [9, 8], the examples in Figure 1 provide evidence that relationships between distinct entities may not be sufficiently captured. In this paper we address this limitation by directly supervising the learning of attention. An affinity graph is first build from the pair-wise attention weights and a novel target affinity mass loss is then applied to guide the learning of attention between distinct objects, allowing the recovery of more plausible relationships.

2.3 Mini-batch Training

The training of a neural network often requires working with mini-batches of data, because typical datasets are too large for present architectures to handle. The optimization of mini-batch training is thus a research topic in its own right. Much work has focused on improving the learning strategies, going beyond stochastic gradient decent (SGD), including [21, 3, 24, 12]. In addition, batch normalization [10] has shown to improve the speed, performance, and stability of mini-batch training, via the normalization of each neuron’s output to form a unified Gaussian distribution across the mini-batch.

In the present article we show that our affinity supervision on a graph built from mini-batch features can benefit the training of a neural network. By increasing the affinity (similarity) between mini-batch entries that belong to the same category, performance in image classification on a diverse set of benchmarks, is consistently improved. We shall discuss mini-batch affinity learning in more detail in Section 5.

3 Affinity Graph Supervision

We now introduce our approach to supervising the weights in an affinity graph. Later we shall cover two applications: affinity supervision on visual attention networks (built on top of Relation Networks [7]) in Section 4 and affinity supervision on a batch similarity graph in Section 5.

3.1 Affinity Graph

We assume that there are entities generated by a feature embedding framework, for example, a region proposal network (RPN) together with ROI pooling on a single image [23], or a regular CNN applied over a batch of images. Let be the embedding feature for the -th entity. We define an affinity function which computes an affinity weight between a pair of entities and entity , as

(1)

A specific form of this affinity function applied in attention networks [7, 26] is reviewed in Section 4, and another simple form of this affinity function applied in batch training is defined in section 5.

We now build an affinity graph whose vertices represent entities in the data source with features and whose edge weights represent pairwise affinities between the vertices. We define the graph adjacency matrix for this affinity graph as the matrix with entries . We propose to supervise the learning of so that those matrix entries selected by a customized supervision target matrix will increase, thus gaining emphasis over the other entries.

3.2 Affinity Target

We now explain the role of a supervision target matrix for affinity graph learning. In general, with

(2)

where stands for a set of possible connections between entities in the data source.

Figure 2: An overview of our affinity graph supervision in visual attention networks, in application to two tasks. The blue dashed box surrounds the visual attention network backbone, implemented according to Relation Networks [7]. The purple dashed box highlights our core component for affinity learning and for relation proposal generation. The green dashed box surrounds the branch for scene categorization. An example affinity target is visualized in the bottom left corner, with solid circles representing ground truth objects colored by their class. The dashed lines between pairs of solid circles give rise to a value of 1 for the corresponding entry in matrix . See the text in Section 4.1 for a detailed description. A detailed illustration of the attention module is in the supplementary material.

Target Affinity Mass

We would like to have higher weights at those entries where , to place emphasis on the entries that are selected by the supervision target. We capture this via a notion of target affinity mass of the affinity graph, defined as

(3)

where is a matrix-wise softmax operation. A study on affinity mass design is available in the supplementary material.

3.3 Affinity Mass Loss

We propose to optimize the learning of the parameters of a neural network to achieve

(4)

Our aim is to devise a strategy to maximize with an empirically determined choice of loss form. There are several loss forms that could be considered, including smooth loss, loss, and a focal loss variant. Defining , we define losses

(5)

and

(6)

The focal loss on is a negative log likelihood loss, weighted by the focal normalization term proposed in [17], which is defined as

(7)

The focal term [17] helps narrow the gap between well converged affinity masses and those that are far from convergence.

Empirically, we have found that the focal loss variant gives the best results in practice, as described in the ablation study reported in Section 6.4. The choice of the term depends on the particular tasks, so we provide experiments to justify our choices in Section 6.4.

3.4 Optimization and Convergence of

The minimization of the affinity mass loss places greater emphasis on entries in which correspond to ground truth connections in , through network training. However, when optimized in conjunction with a main objective loss, which could be an object detection loss in visual attention networks or a cross entropy loss in mini-batch training, a balance between and is required. The total loss can be written as

(8)

Empirically, we choose for visual attention networks and for mini-batch training, we choose . Figure 5 demonstrates the convergence of the target mass, justifying the effectiveness of using loss in the optimization of equation 4.

4 Affinity in Attention Networks

We review the computation of attention weights in [26], given a pair of nodes from the attention graph defined in Section 3.1. Let an entity node consist of its feature embedding, defined as . The collection of input features of all the nodes then becomes . Consider node as a reference object with the attention weight indicating its affinity to a surrounding entity node . This affinity is computed as a softmax activation over the scaled dot products defined as:

(9)

Both and are matrices and so this linear transformation projects the embedding features and into metric spaces to measure how well they match. The feature dimension after projection is . With the above formulation, the attention graph affinity matrix is defined as . For a given reference entity node , the attention module also outputs a weighted aggregation of ’s neighbouring nodes’ features, which is

(10)

The set of eature outputs for all nodes is thus defined as . Additional details are provided in [26, 7].

4.1 Affinity Target Design

For visual attention networks, we want our attention weights to focus on relationships between objects from different categories, so for each entry of the supervision target matrix , we assign only when:

  1. proposal overlaps with ground truth object ’s bounding box with intersection over union .

  2. proposal overlaps with ground truth object ’s bounding box with intersection over union .

  3. ground truth objects and are two different objects coming from different classes.

Note that NO relation annotation is required to construct such supervision target.

We choose to emphasize relationships between exemplars from different categories in the target matrix, because this can provide additional contextual features in the attention aggregation (equation 10) for certain tasks. Emphasizing relationships between objects within the same category might be better suited to modeling co-occurence. We provide a visualization of the affinity target and additional studies, in the supplementary material. We now discuss applications that could benefit from affinity supervision of the attention weights: object detection, relationship proposal generation, and scene categorization.

4.2 Object Detection and Relationship Proposals

In Figure 2 (part A to part B) we demonstrate the use of attention networks for object detection and relationship proposal generation. Here part A is identical to Relation Networks [7]. The network is end-to-end trainable with detection loss, RPN loss and the target affinity mass loss. In addition to the ROI pooling features from the Faster R-CNN backbone of [23], contextual features from attention aggregation are applied to boost detection performance. The final feature descriptor for the detection head is , following [7]. In parallel, the attention matrix output is used to generate relationship proposals by finding the top K weighted pairs in the matrix.

4.3 Scene Categorization

In Figure 2 (part A to part C) we demonstrate an application of visual attention networks to scene categorization. Since there are no bounding box annotations in most scene recognition datasets, we adopt a visual attention network (described in the previous section), pretrained on the MSCOCO dataset, in conjunction with a new scene recognition branch (part C in Figure 2), to perform scene recognition. From the CNN backbone, we apply an additional convolution layer, followed by a global average pooling to acquire the scene level feature descriptor . The attention module takes as input the object proposals’ visual features , and outputs the aggregation result as the scene contextual feature . The input to the scene classification head thus becomes , and the class scores are output. In order to maintain the learned relationship weights from the pre-trained visual attention network, which helps encode object relation context in the aggregation result , we fix the parameters in part A (blue box), but make all other layers in part C trainable.

5 Affinity in mini-Batch

Moving beyond the specific problems of object detection, relationship proposal generation and scene categorization, we now turn to a more general application of affinity supervision, that of mini-batch training in neural networks. Owing to the large size of most databases and limitations in memory, virtually all deep learning models are trained using mini-batches. We shall demonstrate that emphasizing pairwise affinities between entities during training can boost performance for a variety of image classification tasks.

5.1 Affinity Graph

We consider image classification over a batch of images, processed by a convolutional neural network (CNN) to generate feature representations. Using the notation in Section 3, we denote the feature vectors of this batch of images as , where is the image index in the batch. We then build a batch affinity graph whose nodes represent images, and the edge encode pairwise feature similarity between node and .

Distance Metric.

A straightforward distance based measure 1 can be applied to compute the edge weights as

(11)
Figure 3: An overview of our affinity graph supervision in mini-batch training of a standard convolutional neural network. Blue box: CNN backbone for image classification. Purple box: Affinity supervision module for mini-batch training. The colored tiles represent entries of the affinity matrix and target , where a darker color denotes a larger numerical value. Minimization of the affinity mass loss aims to increase the value of the purple squares representin entries in mass (see equation 3).

5.2 Affinity Target Design

In the mini-batch training setting, we would like feature representations from the same class to be closer to each other in a metric space, with those from different classes being spread apart. To this end, we build the affinity target matrix as follows. For each entry in the matrix, we assign only when mini-batch node and belong to the same category. Thus, the affinity target here selects those entries in which represent pairwise similarity between images from the same class. During the optimization of the affinity mass loss (defined in Section 3.3), the network will increase the affinity value from the entries in selected by , while suppressing the other ones. This should in principle leads to improved representation learning and thus benefit the underlying classification task.

5.3 Overview of Approach

A schematic overview of our mini-batch affinity learning approach is presented in Figure 3. Given a batch of images, we first generate the feature representations from a CNN followed by fully connected layers. We then send to an affinity graph module, which contains a pair-wise distance metric computation followed by a matrix-wise softmax activation, to acquire the affinity graph matrix . Next, we built the affinity target matrix from the image category labels following Section 5.2. An element-wise multiplication with is used to acquire the target affinity mass , which is used in computing the affinity mass loss. During training, the network is optimized by both cross entropy loss and the target affinity loss , using the balancing scheme discussed in Section 3.4.

6 Experiments

6.1 Datasets

VOC07:, which is part of the PASCAL VOC detection dataset [4], with 5k images in trainval and 5k in test set. We used this trainval/test split for model ablation purposes.
MSCOCO: which consists of 80 object categories [18]. We used the 30k validation images for training and the 5k “minival” images for testing, which is common practice [7].
Visual Genome: which is a large relationship understanding benchmark [14], consisting of 150 object categories and human annotated relationship labels between objects. We used 70k images for training and 30K for testing, as in the scene graph literature [31, 30].
MIT67: which is a scene categorization benchmark with 67 scene categories, with 80 training images and 20 test images in each category [22]. We used this official split.
CIFAR10/100: which is a popular benchmark dataset containing 32 by 32 tiny images from 10 or 100 categories [15]. We used the official train/test split and we randomly sampled 10% of train set to form a validation set.
Tiny Imagenet: which is a simplified version of the ILSVRC 2012 image classification challenge [2] containing 200 classes [25] with 500 training images and 50 validation images in each class. We used the official validation set as the test set since the official test set is not publicly available. For validation, we randomly sample 10% of the training set.

6.2 Network Training Details

Visual Attention Networks.

We first train visual attention networks [7] end-to-end, using detection loss, RPN loss and affinity mass loss (Figure 2 parts A and B). The loss scale for affinity loss is chosen to be as discussed in Section 3.4. Upon convergence, the network can be directly applied for object detection and relationship proposal tasks. For scene categorization, we first acquire a visual attention network that is pretrained on the COCO dataset, and then use the structural modification in Section 6.6 (Figure 2 parts A and C) to fine tune it on the MIT67 dataset. Unless stated otherwise, all visual attention networks are based on a ResNet101 [6] architecture, trained with a batch size of 2 (images), using a learning rate of which is decreased to after 5 epochs. There are 8 epochs in total for the each training session. We apply stochastic gradient decent (SGD) with momentum optimizer and set the momentum to . We evaluate the model at the end of 8 epochs on the test set to report our results.

Mini-batch Affinity Supervision.

We applied various architectures including ResNet-20/56/110 for CIFAR and ResNet-18/50/101 for tiny ImageNet, as described in [6]. 2 The CIFAR networks are trained for 200 epochs with a batch size of 128. We set the initial learning rate to 0.1 and reduce it by a factor of 10 at epochs 100 and 150, respectively. The tiny ImageNet networks are trained for 90 epochs with a batch size of 128, an initial learning rate of 0.1, and a factor of 10 reduction at epochs 30 and 60. For all experiments in mini-batch affinity supervision, the SGD optimizer with momentum is applied, with the weight decay and momentum set to and . For data augmentation during training, we have applied random horizontal flipping. 3 During training we save the best performing model on the validation set, and report test set performance on this model.

VOC07 Ablation F-RCNN [23] RelNet [7] smooth L1 L2 0 2 5
mAP@all (%) 47.0 47.7 0.1 48.0 0.1 47.7 0.2 47.9 0.2 48.2 0.1 48.6 0.1
mAP@0.5 (%) 78.2 79.3 0.2 79.6 0.2 79.7 0.2 79.4 0.1 79.9 0.2 80.0 0.2
recall@5k (%) - 43.5 60.3 0.3 64.6 0.5 62.1 0.3 69.9 0.3 66.8 0.2
Table 1: An ablation study on loss functions comparing against the baseline faster RCNN [23] and Relation Networks [7], using the VOC07 database. The results are reported as percentages (%) averaged over 3 runs. The relationship recall metric is also reported with ground truth relation labels constructed as described in Section 4.1, using only object class labels.
MIT67 CNN CNN CNN + ROIs CNN + Attn CNN + Attn +
Pretraining Imgnet Imgnet+COCO Imgnet+COCO Imgnet+COCO Imgnet+COCO
Features
Accuracy (%) 75.1 76.8 78.0 0.3 77.1 0.2 80.2 0.3
Table 2: MIT67 Scene Categorization Results, averaged over 3 runs. A visual attention network with affinity supervision gives the best result (the boldfaced entry), with an improvement over a non-affinity supervised version (4-th column) and the baseline methods (columns 1 to 3). See the text in Section 6.6 for details. , and are described in Section 4.3.

6.3 Tasks and Metrics

We evaluate affinity graph supervision on the following tasks, using the associated performance metrics.
Relationship Proposal Generation. We evaluate the learned relationships on the Visual Genome dataset, using a recall metric which measures the percentage of ground truth relations that are covered in the predicted top K relationship list, which is consistent with [32, 31, 30].
Classification. For the MIT67, CIFAR10/100 and Tiny ImageNet evaluation, we use classification accuracy.
Object Detection. For completeness we also evaluate object detection on VOC07, using mAP (mean average precision) as the evaluation metric [4, 18]. Additional detection results on MSCOCO are in the supplementary material.

6.4 Ablation Study on Loss Functions

We first carry out ablation studies to examine different loss functions for optimizing the target affinity mass as well as varying focal terms , as introduced in Section 3.3. The results in Table 1 show that focal loss is in general better than smooth L1 and L2 losses, when supervising the target mass. In our experiments on visual attention networks, we therefore apply focal loss with , which empirically gives the best performance in terms of recovering relationships while still maintaining a good performance in detection task. The results in Table 1 serve solely to determine the best loss configuration. Here we do not claim improvement on detection tasks. The results of additional tests using ablated models will be updated in the supplementary material.

6.5 Relationship Proposal Task

Figure 4 compares the relationships recovered on the Visual Genome dataset, by a default visual attention network “baseline” model (similar to [7]), our affinity supervised network with affinity targets built using only object class labels “aff-sup-obj” (see Section 4.1), and an affinity target built from human annotated ground truth relation labels “aff-sup-rel”. We also include the reported recall metric from Relationship Proposal Networks [32], which is a state-of-the-art level one-stage relationship learning network with strong supervision, using ground truth relationship annotations. Notably, our proposed affinity mass loss does not require potentially costly human annotated relationship labels for learning (only object class labels were used) and yet it achieves the same level of performance as the present state-of-the-art [32] (the blue curve in Figure 4). When supervised with a target built from the ground truth relation labels instead of the object labels, we considerably outperform relation proposal networks (by 25% in relative terms for all K thresholds) with this recall metric (the red curve).

Figure 4: We show the percentage of the true relations that are in the top retrieved relations, with varying , in a relation proposal task. We compare a baseline network (black), Relation Proposal Networks [32] (blue), our affinity supervision using object class labels (but no explicit relations) (orange) and our affinity supervision with ground truth relation labels (red). We match the state of the art with no ground truth relation labels used (the overlapping blue and orange curves). We out perform the state of the art by a large margin (25% in relative terms) when ground truth relations are used.
Figure 5: An ablation study on mini-batch affinity supervision, with the evaluation metric on a test set over epochs (horizontal axis), with the best result highlighted with a red dashed box. Left Plots: classification error rates and target mass with varying focal loss’ parameter. Right Plots: error rates and target mass with varying loss balancing factor (defined in section 3.4).
Figure 6: Left: t-SNE plot of learned feature representations for a baseline ResNet20 network on CIFAR10 dataset. Right: t-SNE plot for affinity supervised ResNet20 network. CIFAR-10 ResNet 20 ResNet 56 ResNet 110 base CNN 91.34 0.27 92.24 0.48 92.64 0.59 Affinity Sup 92.03 0.21 92.90 0.35 93.42 0.38 CIFAR-100 ResNet 20 ResNet 56 ResNet 110 base CNN 66.51 0.46 68.36 0.68 69.12 0.63 Affinity Sup 67.27 0.31 69.79 0.59 70.5 0.60 Tiny Imagenet ResNet 18 ResNet 50 ResNet 101 base CNN 48.35 0.27 49.86 0.80 50.72 0.82 Affinity Sup 49.30 0.21 51.04 0.68 51.82 0.71
Table 3: Batch Affinity Supervision results. Numbers are classification accuracy in percentages. CIFAR results are reported over 10 runs and tiny ImageNet over 5 runs.

6.6 Scene Categorization Task

For scene categorization we adopt the base visual attention network (Figure 2, part A), and add an additional scene task branch (Figure 2, part C) to fine tune it on MIT67, as discussed in Section 4.3. Table 2 shows the results of applying this model to the MIT67 dataset. We refer to the baseline CNN as “CNN” (first column), which is an ImageNet pretrained ResNet101 model directly applied to an image classification task. In the second column, we first acquire a COCO pretrained visual attention network (Figure 2, part A), and fine tune it using only the scene level feature (Figure 2, part C). In the third column, for the same COCO pretrained visual attention network, we concatenate object proposals’ ROI pooling features with to serve as meta scene level descriptor. In the fourth and fifth columns, we apply the full scene architecture in Figure 2 part C, but with a visual attention network that is pretrained without and with (supervised) target affinity loss, respectively. The affinity supervised case (fifth column) demonstrates a non-trivial improvement over the baseline (first to third columns) and also significantly outperforms the unsupervised case (fourth column). This demonstrates that the attention weights learned solely by minimizing detection loss do not generalize well to a scene task, whereas those learned by affinity supervision can.

6.7 Mini-Batch Affinity Supervision

We conducted model ablation study on and parameters, introduced in section 3, as summarized in Figure 5. In the subsequent experiments we chose and based on the ablation plots for error rates in Figure 5.

Convergence of Target Mass.

We plot results showing convergence of the target affinity mass during learning in Figure 5. There is a drastic improvement over the baseline target mass convergence, when affinity supervision is enabled. The chosen empirically gives sufficient convergence on Target mass (right-most in Figure 5).

Per-class feature separation.

A comparison of t-SNE [19] plots on learned feature representations from 1) baseline CNN and 2) CNN supervised with affinity mass loss is presented in Figure 6. Note that the feature separation between different classes is better in our case.

Results.

We now summarize the results for mini-batch affinity learning on CIFAR10, CIFAR100 and TinyImageNet in Table 3. Overall, we have a consistent improvement over the baseline, when using the affinity supervision in mini-batch training. In particular, for datasets with a large number of categories, such as CIFAR100 (100-classes) and tiny ImageNet (200-classes), the performance gain is above 1%. Another advantage of affinity supervision is that we do not introduce any additional network layers or parameters, except for the need for computing the affinity matrix and its loss. Therefore, the we found the training run-time of affinity supervision very close to the baseline CNN.

7 Conclusion

In this paper we have addressed an overlooked problem in the computer vision literature, which is the direct supervision of affinity graphs applied in deep models. Our main methodological contribution is the introduction of a novel target affinity mass, and its optimization using an affinity mass loss. These novel components lead to demonstrable improvements in relationship retrieval. In turn, we have shown that the improved recovery of relationships between objects boosts scene categorization performance. We have also explored a more general problem, which is the supervision of affinity in mini-batches. Here, in diverse visual recognition problems, we see improvements once again. Given that our affinity supervision approach introduces no additional parameters or layers in the neural network, it adds little computational overhead to the baseline architecture. Hence it can be adopted by the community for affinity based training in other computer vision applications as well.

Acknowledgments

We thank the Natural Sciences and Engineering Research Council of Canada (NSERC) and Adobe Research, for research funding.

Supplementary Material

Appendix A Runtime Comparisons

VOC07-Res101 1-epoch Runtime GPU Memory
Baseline 79.7 minutes 3137 MB
Baseline + 82.3 minutes 3141 MB
CIFAR10-Res110 1-epoch Runtime GPU Memory
Baseline 34.7 seconds 2117 MB
Baseline + 35.9 seconds 2139 MB
Table 4: Training time (runtime) versus GPU memory consumption between the baseline and our affinity supervised network (denoted as “Baseline + ”). For VOC07 we apply RelNet [7] as the baseline. Its affinity supervised version is discussed in Section 4.2 of our article. For CIFAR-10, the baseline is a ResNet-110 network [6] and its affinity supervised version is discussed in Section 5.3 of our article.

We provide efficiency analysis for visual attention networks as well as mini-batch training, with and without the affinity loss. The results are summarized in Table (4), with a ResNet101 structure trained on VOC07 dataset for attention networks and a ResNet110 structure trained on CIFAR10 for mini-batch training. For all experiments reported in Table (4), we used a machine configuration of a single Titan XP GPU, an Intel Core i9 CPU and 64GBs of RAM. The affinity supervision did introduce a small percentage increment to runtime and GPU memory, but the benefits are appreciable, as reflected in the multiple experiments reported in our article.

Cases with improved results Cases with similar results
RelNet [7] with Affinity-Sup RelNet [7] with Affinity-Sup
Figure 7: Additional comparisons of recovered relationships on test images, including cases with a clear improvement over the baseline (left) and cases where the results are comparable). The affinity supervision applied to acquire these results do not use human annotated relationship labels. See the text in Figure 1 of the main article for details about the representation. Zoom in on the PDF file to see the attention weight values.

Appendix B Object Detection Results

As promised in our article, we present the object detection results of affinity supervised attention networks in Table 5. We report results on VOC07 and the COCO split we applied in our article. In both cases we improve upon the baseline and slightly outperform the unsupervised case (similar to Relation Networks [7]). This suggests that relation weights learned using affinity supervision are at least as good as those from the unsupervised case [7], in terms of object detection performance.

VOC07 base F-RCNN RelNet [7] RelNet +
avg mAP (%) 47.0 47.7 0.1 48.2 0.1
mAP@0.5 (%) 78.2 79.3 0.2 79.9 0.2
mini COCO base F-RCNN RelNet [7] RelNet +
avg mAP (%) 26.8 27.5 27.9
mAP@0.5 (%) 46.6 47.4 47.8
Table 5: Object Detection Results. mAP@0.5: average precision over a bounding box overlap threshold set to . avg mAP: averaged mAP over multiple bounding box overlap thresholds. VOC07 experiments are reported over 3 runs, demonstrating stability. stands for detection task loss as defined in [23] and for the target affinity mass loss defined in Section 3.3 of the main article.

Convergence of Target Mass

We also provide results showing the convergence of target mass in the context of visual attention networks, in Table (6). It is evident that the affinity mass loss succeeds in optimizing the target mass, when compared with a baseline Relation Network model [7]. This is also indirectly supported by the dramatic improvement in the recall metric, reported in Table 1 of the main article and Table (8), when comparing between the baseline visual attention network and its affinity supervised version.

COCO Training Testing
RelNet [7] 0.020 0.013
RelNet + 0.747 0.459
Table 6: We compare target mass values for a visual attention network supervised with (RelNet + ) and without (RelNet) the affinity mass loss, using the target constructed in Section 4.1 of our main article. The values reported are evaluated on the COCO split, that is described in the experiment section.

Appendix C Additional Illustrations

Additional examples of visual relationships, recovered using baseline Relation Networks [7] and its affinity supervised version (discussed in Section 4.2 of our article), are provided in Figure A. Here we allow all regions with different object class labels to have a potential relation with one another. No human annotated relationship labels are used during training. We include both examples showing improvement and ones where the results are comparable. Whereas the baseline method can be effective at times (Figure (A) right half), the affinity supervision improves its consistency. This claim is also supported by the relationship proposal results on Visual Genome reported in Section 6.5 of the main article.

Figure 8: A detailed illustration of the structure inside the “attention module” shown in Figure 2 of the main article. An explanation of each step in this figure is provided at the beginning of Section 4 in the article.

Appendix D Structure inside Attention Module

A structural overview of the visual attention module used throughout the main article is presented in Figure (8).

Appendix E Supervision Targets

The proposed loss requires a task specific design of a supervision target . The flexibility in choosing this target is one of the core advantages of our affinity supervision, that is, it can be any user designed matrix target. In fact, in the experiments reported in the main article, we construct this matrix automatically using only object class labels, i.e., no labeled relations are required. In the context of mini-batch training, the target design is straightforward. Here we aim to reduce the within-class feature distances between batch images. Thus, a same-category target is adopted. This target increases the similarity metric between same-class connections in the weight matrix and because of the matrix-wise softmax activation, connections between instances from different classes are suppressed.

For the case of visual attention networks, various supervision targets can be applied to adapt the method for different downstream applications. In the main draft, our goal is to improve visual recognition using contextual features aggregated by the attention module, with improved object-wise relationship recovery. Thus, the supervision target emphasizes relationships between instances from different categories. However, given a distinct vision task, such as learning human-to-human interaction or human-to-X interaction, the target could also be constructed by only selecting human-to-human or human-to-X instance connections, while suppressing other possibilities.

To support the idea that the target is adaptive, we provide an exemplar ablation study on VOC07 detection task. We first consider the supervision target proposed in the main draft as different-category supervision. We now consider the case where attention between distinct objects belonging to the same category is also of interest, leading to same-category connections in the target matrix . We refer to this as different-instance supervision. We provide a visual example of the above mentioned supervision targets in Figure 9. In Table 7, we provide object detection results on VOC07 when supervising the affinity graph using different targets.

Different-Category Different-Instance
Figure 9: The visualization of supervision targets for attention networks. The blue box indicates a fixed reference object and the orange boxes indicate the objects that have a ground truth relationship with , for which we assign . Left: different category supervision. Note that the sheep in the blue box is not related to the other sheep in the image. Right: different instance supervision.
VOC07 varying Diff-Instance Diff-Category
avg mAP (%) 47.6 0.1 48.2 0.2
mAP@0.5 (%) 79.5 0.2 79.9 0.2
Table 7: Detection results on the VOC07 dataset when varying supervision targets, where we show mean accuracy over 3 runs.

In summary, the affinity supervision can be adapted to different targets, to achieve various goals or to handle distinct downstream tasks. However, the successful construction of such a target is task dependent.

Table 8: Evaluating different target mass definitions. Results reported are relationship recall metric with varying top K, using the VOC07 test set. The supervision target is constructed following Section 4.1 of main draft.

Appendix F Target Mass Definition

The definition of target mass, as in Section 3.2 of the main article, could have slightly different variations. We defined it as a summation over selected entries, in a matrix-scale. However, it is entirely possible to define such a summation over a row of matrix , when the softmax activation applied is a row-wise operation. That is we only consider a row of matrix during the softmax:

(12)

We would build the target as we originally proposed, but compute the target mass in a row-wise manner and apply the affinity mass loss over the row-aggregated mass. For a given row , we define its target mass as , and thus the affinity mass loss can be written as

(13)

To justify the selected matrix-wise formulation that we proposed in the main draft, we provide the following ablation study, on relationship recovery recall metric using the VOC07 dataset. In the reported results here, we define the aforementioned row-wise target mass formulation as “row” and the matrix-wise version used in the main draft as “mat”, but supervised with only the log-loss of . Lastly, to demonstrate the benefit of focal terms in the final loss form, which is

(14)

we define the matrix-wise supervision with focal term as “mat-focal”. The recall measurement results are summarized in Table (8). We emphasize that recall reported here is always based on ranking the affinity weights post the matrix-wise softmax, ensuring fairness of the comparisons.

The results suggest that the affinity weights, when supervised using the affinity mass loss regardless of its form, are better than the unsupervised case (similar to Relation Networks [7]). Between different variations of target mass and loss forms, the choice of matrix-wise formulation with focal term gives the best results.

One can further simplify the definition of target mass to a single entry in matrix , and use a binary cross entropy loss over the Sigmoid activation of , which is . The loss can be written as

(15)

where simply stands for the -th entry of target matrix . Within multiple trials of a wide range of choices for the defined in Section 3.4 of the main article, we found that this single entry based formulation does not converge to a sufficiently large target mass value and the recall metric is very close to the baseline unsupervised case, thus these results are inferior to the earlier formulations.

In our loss design, the distinction between matrix softmax in affinity loss and row softmax in feature aggregation is essential, see Figure 8. In affinity learning we care about accurately representing the strength of node-to-node connection. For instance, if a node has weak connection to all its neighbors, its edges should have relatively small weights. Following related work [11, 27, 26], a row-wise softmax is applied in feature aggregation. This ensures a unified scaling of the aggregation result, so that a node with low affinity weights is not suppressed, and one with high weights is not dominant.

Footnotes

  1. More elaborate distance metrics could also be considered, but that is beyond the focus of this article.
  2. The network architectures are exactly the same as those in the original ResNet paper [6].
  3. For the CIFAR datasets, we also applied 4-pixel padding, followed by random cropping after horizontal flipping, following [6].

References

  1. M. Defferrard, X. Bresson and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. NIPS. Cited by: §1, §2.1.
  2. J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei (2009) Imagenet: A large-scale hierarchical image database. CVPR. Cited by: §6.1.
  3. J. Duchi, E. Hazan and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research. Cited by: §2.3.
  4. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV. Cited by: §6.1, §6.3.
  5. W. Hamilton, Z. Ying and J. Leskovec (2017) Inductive representation learning on large graphs. pp. 1025–1035. Cited by: §2.1.
  6. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. Cited by: Table 4, §6.2, §6.2, footnote 2, footnote 3.
  7. H. Hu, J. Gu, Z. Zhang, J. Dai and Y. Wei (2018) Relation networks for object detection. CVPR. Cited by: Table 4, Appendix A, Appendix B, Table 5, Table 6, Appendix B, Appendix C, Appendix F, Affinity Graph Supervision for Visual Recognition, §1, §1, Figure 1, §2.2, §2.2, Figure 2, §3.1, §3, §4.2, §4, §6.1, §6.2, §6.5, Table 1.
  8. J. Hu, L. Shen, S. Albanie, G. Sun and A. Vedaldi (2018) Gather-excite: exploiting feature context in convolutional neural networks. NIPS. Cited by: §1, §2.2.
  9. J. Hu, L. Shen and G. Sun (2018) Squeeze-and-excitation networks. CVPR. Cited by: §1, §2.2.
  10. S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.3.
  11. B. Jiang, Z. Zhang, D. Lin, J. Tang and B. Luo (2019) Semi-supervised learning with graph learning-convolutional networks. CVPR. Cited by: Appendix F, §1, §1.
  12. D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §2.3.
  13. T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §1.
  14. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li and D. A. Shamma (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §6.1.
  15. A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §6.1.
  16. R. Li, S. Wang, F. Zhu and J. Huang (2018) Adaptive graph convolutional neural networks. AAAI. Cited by: §1, §1.
  17. T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. CVPR. Cited by: §3.3.
  18. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. ECCV. Cited by: §6.1, §6.3.
  19. L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research. Cited by: §6.7.
  20. F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda and M. M. Bronstein (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. CVPR. Cited by: §1, §2.1.
  21. N. Qian (1999) On the momentum term in gradient descent learning algorithms. Neural networks. Cited by: §2.3.
  22. A. Quattoni and A. Torralba (2009) Recognizing indoor scenes. CVPR. Cited by: §6.1.
  23. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. NIPS. Cited by: Table 5, §3.1, §4.2, Table 1.
  24. RMSprop optimizer. Note: \urlhttp://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdfAccessed: 2019-11-11 Cited by: §2.3.
  25. Tiny imagenet visual recognition challenge. Note: \urlhttps://tiny-imagenet.herokuapp.com/Accessed: 2019-11-11 Cited by: §6.1.
  26. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. NIPS. Cited by: Appendix F, §1, §1, §2.1, §2.2, §3.1, §4.
  27. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: Appendix F, §1, §1.
  28. C. Wang, B. Samari and K. Siddiqi (2018) Local spectral graph convolution for point set feature learning. ECCV. Cited by: §1, §2.1.
  29. X. Wang, R. Girshick, A. Gupta and K. He (2018) Non-local neural networks. CVPR. Cited by: §1, §2.2, §2.2.
  30. D. Xu, Y. Zhu, C. Choy and L. Fei-Fei (2017) Scene graph generation by iterative message passing. CVPR. Cited by: §6.1, §6.3.
  31. R. Zellers, M. Yatskar, S. Thomson and Y. Choi (2018) Neural motifs: scene graph parsing with global context. CVPR. Cited by: §6.1, §6.3.
  32. J. Zhang, M. Elhoseiny, S. Cohen, W. Chang and A. Elgammal (2017) Relationship proposal networks. CVPR. Cited by: §1, Figure 4, §6.3, §6.5.
  33. H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin and J. Jia (2018) Psanet: point-wise spatial attention network for scene parsing. ECCV. Cited by: §1, §2.2, §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
412490
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description