Efficient Coarse-to-Fine Non-Local Module for the
Detection of Small Objects
An image is not just a collection of objects, but rather a graph where each object is related to other objects through spatial and semantic relations. Using relational reasoning modules, allowing message passing between objects, can therefore improve object detection. Current schemes apply such dedicated modules either on a specific layer of the bottom-up stream, or between already-detected objects. We show that the relational process can be better modeled in a coarse to fine manner and present a novel framework, applying a non-local module sequentially to increasing resolution feature-maps along the top-down stream. In this way, the inner relational process can naturally pass information from larger objects to smaller related ones. Applying the modules to fine feature-maps also allows message passing between the small objects themselves, exploiting repetitions of instances from of the same class. In practice, due to the expensive memory utilization of the non-local module, it is unfeasible to apply the module as currently used to high-resolution feature-maps. We efficiently redesigned the non local module, improved it in terms of memory and number of operations, allowing it to be placed anywhere along the network. We also incorporated relative spatial information into the module, in a manner that can be incorporated into our efficient implementation. We show the effectiveness of our scheme by improving the results of detecting small objects on COCO by 1.5 AP over Faster RCNN and by 1 AP over using non-local module on the bottom-up stream.
Scene understanding has shown an impressing improvement in the last few years. Since the revival of deep neural networks, there has been a significant increase of the performance of a range of relevant tasks, including classification, object detection, segmentation, part localization etc.
Early works relied heavily on the hierarchical structure of bottom up classification networks to perform additional tasks such as detection [14, 13, 41, 20], by using the last network layer to predict object locations. A next significant step, partly motivated by the human vision system, incorporated context into the detection scheme by using a bottom-up top-down architecture [30, 19, 10, 42]. This architecture combines high level contextual data from the last layers with highly localized fine-grained information expressed in lower layers. The next challenge, that is an active research area, is to incorporate relational reasoning into the detection systems [2, 7, 40]. By using relational reasoning, an image is not just a collection of unrelated objects, but rather resembling a ”scene graph” of entities (nodes, objects) connected by edges (relations, predicates).
In this line of development, the detection of small objects remains a difficult task. This task was shown to benefit from the use of context [9, 11], and the current work applies the use of relations to the detection of small objects.
Consider for example the images in figure 1. The repetition of instances from the same class in the image, as well as the existence of larger instances from related classes, serves as a semantic clue. It enables the detection of the tiny people in the sea (figure 0(a)), partly based on the existence of the larger people in the shore. It similarly localizes the small sport ball, partly based on the throwing man and the waiting glove (figure 0(b)).
Exploiting this information, specifically for small object detection, requires propagating information over large distances in high resolution feature-maps according to the data in a specific image. This is difficult to achieve by convolutional layers, since they transmit information in the same manner for all images, based on learning, and over short distances rather than the entire image.
Recently, a Non-Local module  has been formulated and integrated into CNNs for various tasks [50, 45]. Its formalism is simple: for each output pixel in the feature-map the scheme aggregates information from all of the input pixels ( ), based on their similarity to the specific input pixel . This building block is capable to pass information between distant pixels according to their appearance, and is applicable to our current task. Using it, sequentially, in a coarse-to-fine manner, enables to pass semantic information from larger, easy to detect objects, to smaller ones. Existence of the non-local (NL) module in shallower layers allows information jumps between the small objects themselves. Evidence for the above can be seen in Figures 6 and 7.
For the current needs, there are two disadvantages in the original design of the NL module. The first is its expensive computational and memory budget. The complexity of is . In preceding works this block was integrated into layers of the bottom-up stream, but in our task it is integrated into lower-level layers, where its memory demands become infeasible. Furthermore, in detection networks, it is a common practice to enlarge the input image, making the problem even worse.
The second disadvantage is the lack of relative position encoding in . Objects in the image are relatively placed on a 2D grid. Discarding this information source, especially in high resolution feature-maps, is not effective.
We modified the NL module to deal with the above difficulties. A simple modification, based on the associative law of matrix multiplication, and exploiting the existing factorization of , enables us to create a comparable building block with a complexity of . Relative position encoding was added to the similarity information and give the network the opportunity to use relative spatial information in an efficient manner. The resulting scheme still aggregates information across the entire image, but not uniformly. We named this module ENL: Efficient Non Local module.
In this paper, we use the ENL module as a reasoning module that passes information between related pixels, applying it sequentially along the top-down stream. Since it is applied also to high resolution feature maps, efficiently re-implementation of the module is essential.
Unlike other approaches, which placed a relational module on the BU stream, or establish relations between already detected objects, our framework can apply pairwise reasoning in a coarse to fine manner, guiding the detection of small objects. Applying the relational module to finer layers, also enables the small objects themselves to exchange information between each other.
To summarize our contributions:
1. We efficiently redesigned the NL module (ENL), improved it in terms of memory and number of operations, allowing it to be placed it anywhere along the network.
2. We incorporated relative spatial information into the NL module reasoning process, in a novel approach that keeps the efficient design of the ENL.
3. We applied the new module sequentially to increasing resolution feature-maps along the top-down stream, obtaining relational reasoning in a coarse-to-fine manner.
The improvements presented in this work go beyond the specific detection application: tasks including semantic segmentation, fine-grained localization, images restoration, image generation processes, or other tasks in the image domain, which use an encoder-decoder framework and depend on fine image details are natural candidates for using the proposed framework.
2 Related Work
The current work combines two approaches used in the field of object detection: (a) modelling context through top down modulation and (b) using non local interactions in a deep learning framework. We briefly review related work in these domains.
Bottom Up Top Down Networks
In detection tasks, one of the major challenges is to detect simultaneously both large and small objects and parts. Early works used for the task a pure bottom-up (BU) architecture, and predictions were made only from the coarsest (topmost) feature map [14, 13, 41, 20]. Later works, tried to exploit the inherent hierarchical structure of neural networks to create a multi-scale detection architecture. Some of these works performed detection using combined features from multiple layers [3, 18, 25], while others performed detection in parallel from individual layers [35, 6, 34, 47].
Recent methods incorporate context (from the last BU layer) with low level layers by adding skip connections in a bottom-up top-down (BUTD) architecture. Some schemes [49, 44, 36] used only the last layer of the top down (TD) network for prediction, while others [30, 19, 10, 42] performed prediction from several layers along the TD stream.
The last described architecture supplies enhanced results, especially for small objects detection and was adopted in various detection schemes (e.g. one stage or two stages detection pipelines). It assumes to successfully incorporate multi scale BU data with semantic context from higher layers, serves as an elegant built-in context module.
In the current work we further enhance the representation created in the layers along the TD stream, using the pair-wised information, supplied by the NL module, already shown to be complementary to the CNN information . We show that sequentially applying this complementary source of information, in a coarse to fine manner, helps detection, especially of small objects.
It is well known that context modelling plays an important role in detection both in humans [1, 4] and AI systems [9, 11]. There is also evidence that the use of context for detection and recognition is guided by a TD process [38, 37, 4]).
Contextual information includes both local context (the immediate local environment of a given object) or global context (the full scene category, or relationships between objects in the scene).
Examples of using local context in recent detection networks include [12, 53, 54, 27, 56]. These works model the local context by extracting local data about the proposed RoI (the output of the RPN), by adding larger surroundings windows, or by using nearby, automatically located, contextual regions .
Global scene categorization can also help detection [3, 39, 48] and segmentation . Along this line,  stressed the role of the TD guidance in modelling and using context. Context and detection interact in fact in both directions, since object detection can also help global scene categorization, [29, 24], and this two-way interactions have been modeled by an iterative feedback scheme [48, 28].
Modern Relational Reasoning
Relational reasoning and messages passing between explicitly detected, or implicitly represented objects in the image, is an active and growing research area. Recent work in this area has been applied to scene understanding tasks (e.g. recognition , detection [51, 22, 40], segmentation ) and for image generation tasks (GANs , restoration ).
For scene understanding tasks, two general approaches exist. The first approach can be called ’object-centric’, as it models the relations between existing objects, previously detected by a detection framework . In this case, a natural structure for formalizing relation is via Graph Neural Network [16, 46]; [2, 5] summarize and generalize many aspects in the growing field.
The second approach applies relational interactions directly to CNN feature-maps (in which objects are implicitly represented). In this case, a dedicated block (sometimes named non local module , relational module  or self-attention module ) is integrated into the network without an additional supervision, in an end-to-end learnable manner. In the frameworks of detection and segmentation, this block can be integrated into the network’s backbone, preceding to the RPN, to supply additional information for both recognition and localization tasks (example in figure 2).
We will first briefly review the implementation details of the Non Local (NL) module as described in  then present our proposed efficient ENL module, specifying our modifications in detail.
3.1 Preliminaries: the Non Local module
The formulation of the NL module as described in  is:
Here is the input tensor and is the output tensor. x and y share the same dimensions, , where D is the channels dimension, and modified to the form of . is the current pixel under consideration, and runs over all spatial locations of the input tensor. summarizes the similarity between every two pixels in the input tensor ( is a scalar) and is the representation of the ’th spatial pixel (, channels in each pixel’s representation). The module sums information from all the pixels in the input tensor weighted by their similarity to the ’th pixel. The similarity function , can be chosen in different ways; one of the popular design choices is:
In this case the normalization factor takes the form of the softmax operation. A block scheme of this straight forward implementation is illustrated in figure 2(a).
The described NL module goes through a convolution and combined with a residual connection to take the form of:
Two drawbacks of this basic implementation are its extremely expensive memory utilization and the lack of position encoding. Both of these issues are addressed next.
3.2 ENL: Memory Effective Implementation
Let us consider the case of another design choice of :
In this case is a matrix created by a multiplication of two matrices, and . Since this matrix multiplication is immediately followed by another matrix multiplication with - one can simply use the associative rule to change the order of the calculation:
|NL module||ENL module|
This re-ordering results in a large saving in terms of memory and operations used. Consider a detection framework with typical image size of . On the second stage (stride 4) . While the inner multiplication result by sequentially multiplying the matrices (original NL, figure 2(a)) is , the multiplication reordering (ENL module) gives an inner result of size (at least 4 orders reduction of memory utilization inside the block). The reduction in the number of operations is determined in a similar manner, see table 1. An illustration of the memory effective implementation can be visualized in figure 2(b).
3.3 Adding Relative Position Encoding
We next consider two version of adding position encoding , , to our scheme. The first is based on the norm of followed by an exponent (similarly to ) and is applicable only for the case where we use a full version of as an inner variable. In this case:
With denotes elementwise multiplication or addition. This version is illustrated in figure 4(a). The second version of position encoding is applicable also to the efficient implementation. In this case, the formulation is given by:
For we use:
With , . In this case we can change again the order of matrix multiplication and equation 7 gets the form of:
Note that adding a spatial filter in general is straightforward, but here we want to add a spatial filter in a manner that will keep the low-rank properties of the ENL module.
To construct , denote as the coefficients of the 2D cos transform of a one-hot image input (one in the i’th location and zero everywhere else), arranged as a row vector. Denote by the matrix consisting of s as its rows. Followed from the orthogonality of the cos transform, is a diagonal matrix (weight for and everywhere else). Truncating the coefficients vectors (taking a subpart of the columns of , denoted as , corresponding to the lower frequencies of the cos transform), meets two goals:
a. , while , and can be elegantly integrated into the ENL design (see equation 9), and
b. serves as a low pass filter.
A block scheme of the implementation is detailed in figure 4(b).
We used the first columns corresponding to the lowest frequencies of the 2D cos transform. Optimizing the choice of the columns can be added but is out of the scope of this paper. The resulting (sinc-like) filter is almost invariant to the spatial position; its general structure is kept although fluctuations in its height exist. An example is demonstrated in figure 4.
We performed our experiences on Faster R-CNN  detection framework, using FPN  with resnet50 , pretrained on ImageNet  as its backbone. We implemented our models on caffe2 using the Detectron framework . We used the standard running protocols of Faster R-CNN and adjust our learning rates as suggested by . The images were normalized to 800 pixels on their shorter axis. All the models were trained on COCO train2017 (, 118K images) and were evaluated on COCO val2017 (5K images).
We trained our models for 360000 iterations using a base learning rate of 0.005 and reducing it by a factor of 10 after 240000 and 320000 iterations. We used SGD optimization with momentum 0.9 and a weight decay of 0.0001. We froze the BN layers in the backbone and replaced them with an affine operation as the common strategy when fine-tuning with a small number of images per GPU.
During inference we followed the common practice of [43, 19]. We report our results based on the standard metrics of COCO, using AP (mean average precision) and APsmall (mean average precision for small objects, ) as our main criterions for comparison. Further explanations of the metrics can be found in .
Non Local Block
We placed the NL modules along the top down stream. We used three instances of the NL module in total, and located them in each stage, just before the spatial interpolation (in parallel to res5, res4 and res3 layers).
We initialized the blocks weights with random Gaussian weights, . We did not use additional BN layers inside the NL module (due to the relatively small minibatch), or affine layers (since no initialization is available).
5 Experiments & Results
We evaluated our framework on the task of object detection, comparing to Faster RCNN  as a baseline, demonstrating an improvement in performance.
5.1 Comparison with state of the art results
Table 2 compares the detection results of the proposed scheme (+3NL, TD, using Faster RCNN with three additional ENL modules sequentially located along the TD stream) to the baseline (Faster RCNN) and to the variant suggested in  (+1NL, BU). Adding three non-local modules along the TD stream in a coarse-to-fine manner leads to almost APsmall improvement over the baseline and almost APsmall improvement over adding non local module in the BU stream.
The improvement over the baseline emphasizes the potential of adding non-local modules in general: they exploit the data in the network in a complementary way to the convolutional layers. The improvement over +1NL, BU (a non local module in the BU) can be explained by the existence of non local modules in the shallower layers of the network, by the coarse-to-fine guidance through the TD stream and by the relative position encoding added to the scheme. We saw on the ablation studies that adding position encoding to the non-local modules on the TD stream improves the detection results. Interestingly, adding relative position encoding to the non local module in the BU stream (third line in the table) didn’t improve the results.
Examples of interest are shown in Figure 6. The examples illustrate the detection of small objects, that cannot be detected on their own, detected either by the presence of other instances of the same class (a,b) or by larger instances of related classes: (c) The man is holding two remotes in his hands, (d) the woman is holding and watching her cellphone and (e) detecting the driver inside the truck. These objects, marked in red, were not detected by Faster RCNN or by Faster RCNN with non-local module on the BU stream (using the same threshold). In figure 7 we present the attention maps, , extracted from the non-local modules in response to the two images presented in figure 1. The left column shows the attention maps of the NL module on the BU stream. Here the attention is spread across the spatial grid, especially around large objects (the people on the shore and the glove). The right column shows the attention maps of the ENL module in the finest feature-map. The small objects are clearly seen, along with the local response of the low pass filter.
5.2 Ablation studies
We performed control and ablation experiments to study some aspects in the design of the ENL modules. Unless specified otherwise, all experiments were carried out using Faster RCNN with three additional non local modules (original modules or their efficient version, with or without position encoding). Due to the relative long training time of Faster RCNN we used smaller images, normalized to 600 pixels on their shorter axis. The rest of the training details follow section 4.
Efficient implementation normalization strategies
Referring to the formulation of the original NL module in equation 1 the normalization factor is:
This straightforward normalization strategy is possible only if was calculated explicitly. in the ENL design, a special care must be taken in performing normalization.
Table 5 compares different normalization strategies of and in the ENL module (Section 3.2). Performing no normalization at all results in a quick divergence in the loss of the training process. For comparison, the results of the original NL module, normalized with softmax or with , are presented in the first two lines of the table, respectively.
In the test summarized in table (5), we can assume that using softmax to normalize was sufficient, and achieved results on par with the original NL module, when normalized with (using the design choice given by equation 4). In the rest of the paper we used this normalization strategy by default. Normalizing the original NL module with softmax yields slightly better results, in agreement with the findings in .
Adding relative position encoding
Table 4 compares four different ways to encode the relative position information in (using or cos transform, addition or multiplication, see section 3.3). The results of the same network without position encoding are presented in the first row of the table.
Table 4 shows that additive spatial attention, at least in our case, gives better results than multiplicative spatial attention. A possible explanation is that multiplicative spatial attention completely suppresses the influence of distant pixels, which goes against the purpose of the NL module.
Table 4 also demonstrates that a spatial filter based on the separable cos transform performs slightly better than the counterpart version, based on the norm. Conveniently, this is also the spatial filter that can be integrated into the more compact implementation of the ENL module.
Adding the NL block on the TD stream
Table 4 compares between several variants of the proposed network (+3NL, TD, 3 last lines) to the baselines.
The table shows that the efficient module (lines 5,6) is an attractive alternative to the original non-local module (line 4), and that position encoding (line 6) further improves it.
The improvement over the baseline is even higher than the improvement shown in table 2, maybe because the area statistics in the database are different for smaller images. The improvement is not just a matter of more parameters. Table 4 shows that adding 5x5 convolution layers, along the TD stream in the same positions, does not yield a distinct improvement in results.
|(a) NL, BU||(b) ENL, TD3|
We examined the possible use of several non local modules, arranged hierarchically along the top down stream to exploit the effects of context and relations among objects. We compared our method with the previous use of a non local module placed on the bottom-up network, and show 1 AP improvement in small objects detection. We suggest that this improvement is enabled by the coarse-to-fine use of pair-wise location information and show visual evidence in support of this possibility.
In practice, applying the non local module to large feature maps is a memory demanding operation. We deal with this difficulty and introduced ENL - an attractive alternative to the Non Local block, which is efficient in terms of memory and operations, and which integrates the use of relative spatial information. The ENL allows the use of non local module in a general encoder-decoder framework and consequently, might contribute in future work to a wide range of applications (segmentation, images generation etc.).
-  M. Bar. Visual objects in context. Nature Reviews Neuroscience, 5(8):617, 2004.
-  P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
-  S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2874–2883, 2016.
-  I. Biederman. On the semantics of a glance at a scene. In Perceptual organization, pages 213–253. Routledge, 2017.
-  M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
-  Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In European Conference on Computer Vision, pages 354–370. Springer, 2016.
-  X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. Iterative visual reasoning beyond convolutions. arXiv preprint arXiv:1803.11189, 2018.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert. An empirical study of context in object detection. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1271–1278. IEEE, 2009.
-  C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
-  C. Galleguillos and S. Belongie. Context based object categorization: A critical survey. Computer vision and image understanding, 114(6):712–722, 2010.
-  S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134–1142, 2015.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, volume 2, pages 729–734. IEEE, 2005.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Object instance segmentation and fine-grained localization using hypercolumns. IEEE Transactions on Pattern Analysis & Machine Intelligence, (4):627–639, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pages 346–361. Springer, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. arXiv preprint arXiv:1711.11575, 8, 2017.
-  W.-C. Hung, Y.-H. Tsai, X. Shen, Z. L. Lin, K. Sunkavalli, X. Lu, and M.-H. Yang. Scene parsing with global context embedding. In ICCV, pages 2650–2658, 2017.
-  S. A. Javed and A. K. Nelakanti. Object-level context modeling for scene classification with context-cnn. arXiv preprint arXiv:1705.04358, 2017.
-  T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 845–853, 2016.
-  B. Li, T. Wu, L. Zhang, and R. Chu. Auto-context r-cnn. arXiv preprint arXiv:1807.02842, 2018.
-  J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan. Attentive contexts for object detection. IEEE Transactions on Multimedia, 19(5):944–954, 2017.
-  K. Li, B. Hariharan, and J. Malik. Iterative instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3659–3667, 2016.
-  Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu. Understand scene categories by objects: A semantic regularized scene classifier using convolutional neural networks. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 2318–2325. IEEE, 2016.
-  T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Detection evaluation.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang. Non-local recurrent network for image restoration. arXiv preprint arXiv:1806.02919, 2018.
-  S. Liu, D. Huang, and Y. Wang. Receptive field block net for accurate and fast object detection. arXiv preprint arXiv:1711.07767, 2017.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
-  A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in recognition. Progress in brain research, 155:23–36, 2006.
-  A. Oliva, A. Torralba, M. S. Castelhano, and J. M. Henderson. Top-down control of visual attention in object detection. In Image processing, 2003. icip 2003. proceedings. 2003 international conference on, volume 1, pages I–253. IEEE, 2003.
-  W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, et al. Deepid-net: Deformable deep convolutional neural networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2015.
-  D. Raposo, A. Santoro, D. Barrett, R. Pascanu, T. Lillicrap, and P. Battaglia. Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068, 2017.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):1137–1149, 2017.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
-  F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
-  Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue. Dsod: Learning deeply supervised object detectors from scratch. In The IEEE International Conference on Computer Vision (ICCV), volume 3, page 7, 2017.
-  A. Shrivastava and A. Gupta. Contextual priming and feedback for faster r-cnn. In European Conference on Computer Vision, pages 330–348. Springer, 2016.
-  A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851, 2016.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. arXiv preprint arXiv:1711.07971, 10, 2017.
-  Y. Yuan and J. Wang. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.
-  S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, and P. Dollár. A multipath network for object detection. arXiv preprint arXiv:1604.02135, 2016.
-  X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang, et al. Crafting gbd-net for object detection. IEEE transactions on pattern analysis and machine intelligence, 40(9):2109–2123, 2018.
-  H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
-  Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu, et al. Couplenet: Coupling global structure with local parts for object detection. In Proc. of Intâl Conf. on Computer Vision (ICCV), volume 2, 2017.