NAS-FCOS: Fast Neural Architecture Search for Object Detection NW, YG, HC contributed to this work equally.

NAS-FCOS: Fast Neural Architecture Search for Object Detection1


The success of deep neural networks relies on significant architecture engineering. Recently neural architecture search (NAS) has emerged as a promise to greatly reduce manual effort in network design by automatically searching for optimal architectures, although typically such algorithms need an excessive amount of computational resources, e.g., a few thousand GPU-days. To date, on challenging vision tasks such as object detection, NAS, especially fast versions of NAS, is less studied. Here we propose to search for the decoder structure of object detectors with search efficiency being taken into consideration. To be more specific, we aim to efficiently search for the feature pyramid network (FPN) as well as the prediction head of a simple anchor-free object detector, namely FCOS [24], using a tailored reinforcement learning paradigm. With carefully designed search space, search algorithms and strategies for evaluating network quality, we are able to efficiently search a top-performing detection architecture within days using V100 GPUs. The discovered architecture surpasses state-of-the-art object detection models (such as Faster R-CNN, RetinaNet and FCOS) by to points in AP on the COCO dataset,with comparable computation complexity and memory footprint, demonstrating the efficacy of the proposed NAS for object detection.

1 Introduction

Object detection is one of the fundamental tasks in computer vision, and has been researched extensively. In the past few years, state-of-the-art methods for this task are based on deep convolutional neural networks (such as Faster R-CNN [20], RetinaNet [11]), due to their impressive performance. Typically, the designs of object detection networks are much more complex than those for image classification, because the former need to localize and classify multiple objects in an image simultaneously while the latter only need to output image-level labels. Due to its complex structure and numerous hyper-parameters, designing effective object detection networks is more challenging and usually needs much manual effort.

On the other hand, Neural Architecture Search (NAS) approaches [4, 17, 32] have been showing impressive results on automatically discovering top-performing neural network architectures in large-scale search spaces. Compared to manual designs, NAS methods are data-driven instead of experience-driven, and hence need much less human intervention. As defined in [3], the workflow of NAS can be divided into the following three processes: ) sampling architecture from a search space following some search strategies; ) evaluating the performance of the sampled architecture; and ) updating the parameters based on the performance.

One of the main problems prohibiting NAS from being used in more realistic applications is its search efficiency. The evaluation process is the most time consuming part because it involves a full training procedure of a neural network. To reduce the evaluation time, in practice a proxy task is often used as a lower cost substitution. In the proxy task, the input, network parameters and training iterations are often scaled down to speedup the evaluation. However, there is often a performance gap for samples between the proxy tasks and target tasks, which makes the evaluation process biased. How to design proxy tasks that are both accurate and efficient for specific problems is a challenging problem. Another solution to improve search efficiency is constructing a supernet that covers the complete search space and training candidate architectures with shared parameters [14, 18]. However, this solution leads to significantly increased memory consumption and restricts itself to small-to-moderate sized search spaces.

To our knowledge, studies on efficient and accurate NAS approaches to object detection networks are rarely touched, despite its significant importance. To this end, we present a fast and memory saving NAS method for object detection networks, which is capable of discovering top-performing architectures within significantly reduced search time. Our overall detection architecture is based on FCOS [24], a simple anchor-free one-stage object detection framework, in which the feature pyramid network and prediction head are searched using our proposed NAS method.

Our main contributions are summarized as follows.

  • In this work, we propose a fast and memory-efficient NAS method for searching both FPN and head architectures, with carefully designed proxy tasks, search space and evaluation strategies, which is able to find top-performing architectures over architectures using GPU-days only.

    Specifically, this high efficiency is enabled with the following designs.

    Developing a fast proxy task training scheme by skipping the backbone finetuning stage;

    Adapting progressive search strategy to reduce time cost taken by the extended search space;

    Using a more discriminative criterion for evaluation of searched architectures.

    Employing an efficient anchor-free one-stage detection framework with simple post processing;

  • Using NAS, we explore the workload relationship between FPN and head, proving the importance of weight sharing in head.

  • We show that the overall structure of NAS-FCOS is general and flexible in that it can be equipped with various backbones including MobileNetV, ResNet-, ResNet- and ResNeXt-, and surpasses state-of-the-art object detection algorithms using comparable computation complexity and memory footprint. More specifically, our model can improve the AP by points on all above models comparing to their FCOS counterparts.

2 Related Work

2.1 Object Detection

The frameworks of deep neural networks for object detection can be roughly categorized into two types: one-stage detectors [12] and two-stage detectors [6, 20].

Two-stage detection frameworks first generate class-independent region proposals using a region proposal network (RPN), and then classify and refine them using extra detection heads. In spite of achieving top performance, the two-stage methods have noticeable drawbacks: they are computationally expensive and have many hyper-parameters that need to be tuned to fit a specific dataset.

In comparison, the structures of one-stage detectors are much simpler. They directly predict object categories and bounding boxes at each location of feature maps generated by a single CNN backbone.

Note that most state-of-the-art object detectors (including both one-stage detectors [12, 16, 19] and two-stage detectors [20]) make predictions based on anchor boxes of different scales and aspect ratios at each convolutional feature map location. However, the usage of anchor boxes may lead to high imbalance between object and non-object examples and introduce extra hyper-parameters. More recently, anchor-free one-stage detectors [9, 10, 24, 29, 30] have attracted increasing research interests, due to their simple fully convolutional architectures and reduced consumption of computational resources.

2.2 Neural Architecture Search

NAS is usually time consuming. We have seen great improvements from GPU-days [32] to GPU-day [28]. The trick is to first construct a supernet containing the complete search space and train the candidates all at once with bi-level optimization and efficient weight sharing [13, 14]. But the large memory allocation and difficulties in approximated optimization prohibit the search for more complex structures.

Recently researchers [1, 5, 23] propose to apply single-path training to reduce the bias introduced by approximation and model simplification of the supernet. DetNAS [2] follows this idea to search for an efficient object detection architecture. One limitation of the single-path approach is that the search space is restricted to a sequential structure. Single-path sampling and straight through estimate of the weight gradients introduce large variance to the optimization process and prohibit the search for more complex structures under this framework. Within this very simple search space, NAS algorithms can only make trivial decisions like kernel sizes for manually designed modules.

Object detection models are different from single-path image classification networks in their way of merging multi-level features and distributing the task to parallel prediction heads. Feature pyramid networks (FPNs)  [4, 8, 11, 15, 27], designed to handle this job, plays an important role in modern object detection models. NAS-FPN [4] targets on searching for an FPN alternative based on one-stage framework RetinaNet [12]. Feature pyramid architectures are sampled with a recurrent neural network (RNN) controller. The RNN controller is trained with reinforcement learning (RL). However, the search is very time-consuming even though a proxy task with ResNet-10 backbone is trained to evaluate each architecture.

Since all these three kinds of research ( [2, 4] and ours) focus on object detection framework, we demonstrate the differences among them that DetNAS [2] aims to search for the designs of better backbones, while NAS-FPN [4] searches the FPN structure, and our search space contains both FPN and head structure.

To speed up reward evaluation of RL-based NAS, the work of [17] proposes to use progressive tasks and other training acceleration methods. By caching the encoder features, they are able to train semantic segmentation decoders with very large batch sizes very efficiently. In the sequel of this paper, we refer to this technique as fast decoder adaptation. However, directly applying this technique to object detection tasks does not enjoy similar speed boost, because they are either not in using a fully-convolutional model [11] or require complicated post processing that are not scalable with the batch size [12].

To reduce the post processing overhead, we resort to a recently introduced anchor-free one-stage framework, namely, FCOS [24], which significantly improve the search efficiency by cancelling the processing time of anchor-box matching in RetinaNet.

Compared to its anchor-based counterpart, FCOS significantly reduces the training memory footprint while being able to improve the performance.

3 Our Approach

In our work, we search for anchor-free fully convolutional detection models with fast decoder adaptation. Thus, NAS methods can be easily applied.

3.1 Problem Formulation

We base our search algorithm upon a one-stage framework FCOS due to its simplicity. Our training tuples consist of input image tensors of size and FCOS output targets in a pyramid representation, which is a list of tensors each of size where is feature map size on level of the pyramid. is the output channels of FCOS, the three terms are length- one-hot classification labels, bounding box regression targets and centerness factor respectively.

The network in original FCOS consists of three parts, a backbone , FPN and multi-level subnets we call prediction heads in this paper. First backbone maps the input tensor to a set of intermediate-leveled features , with resolution . Then FPN maps the features to a feature pyramid . Then the prediction head is applied to each level of and the result is collected to create the final prediction. To avoid overfitting, same is often applied to all instances in .

Since objects of different scales require different effective receptive fields, the mechanism to select and merge intermediate-leveled features is particularly important in object detection network design. Thus, most researches [16, 20] are carried out on designing and while using widely-adopted backbone structures such as ResNet [7]. Following this principle, our search goal is to decide when to choose which features from and how to merge them.

To improve the efficiency, we reuse the parameters in pretrained on target dataset and search for the optimal structures after that. For the convenience of the following statement, we call the network components to search for, namely and , together the decoder structure for the objection detection network.

and take care of different parts of the detection job. extracts features targeting different object scales in the pyramid representations , while is a unified mapping applied to each feature in to avoid overfitting. In practice, people seldom discuss the possibility of using a more diversified to extract features at different levels or how many layers in need to be shared across the levels. In this work, we use NAS as an automatic method to test these possibilities.

3.2 Search Space

Considering the different functions of and , we apply two search space respectively. Given the particularity of FPN structure, a basic block with new overall connection and ’s output design is built for it. For simplicity, sequential space is applied for part.

We replace the cell structure with atomic operations to provide even more flexibility. To construct one basic block, we first choose two layers , from the sampling pool at id1, id2, then two operations op1, op2 are applied to each of them and an aggregation operation agg merges the two output into one feature. To build a deep decoder structure, we apply multiple basic blocks with their outputs added to the sampling pool. Our basic block at time step transforms the sampling pool to , where is the output of .

ID Description
0 separable conv
1 separable conv with dilation rate
2 separable conv with dilation rate
3 skip-connection
4 deformable convolution
Table 1: Unary operations used in the search process.

The candidate operations are listed in Table 1. We include only separable/depth-wise convolutions so that the decoder can be efficient. In order to enable the decoder to apply convolutional filters on irregular grids, here we have also included deformable convolutions [31]. For the aggregation operations, we include element-wise sum and concatenation followed by a convolution.

The decoder configuration can be represented by a sequence with three components, FPN configuration, head configuration and weight sharing stages. We provide detailed descriptions to each of them in the following sections. The complete diagram of our decoder structure is shown in Fig. 1.

Figure 1: A conceptual example of our NAS-FCOS decoder. It consists of two sub networks, an FPN and a set of prediction heads which have shared structures. One notable difference with other FPN-based one-stage detectors is that our heads have partially shared weights. Only the last several layers of the predictions heads (marked as yellow) are tied by their weights. The number of layers to share is decided automatically by the search algorithm. Note that both FPN and head are in our actual search space; and have more layers than shown in this figure. Here the figure is for illustration only.

FPN Search Space

As mentioned above, the FPN maps the convolutional features to . First, we initialize the sampling pool as . Our FPN is defined by applying the basic block times to the sampling pool, . To yield pyramid features , we collect the last three basic block outputs as .

To allow shared information across all layers, we use a simple rule to create global features. If there is some dangling layer which is not sampled by later blocks nor belongs to the last three layers , we use element-wise add to merge it to all output features


Same as the aggregation operations, if the features have different resolution, the smaller one is upsampled with bilinear interpolation.

To be consistent with FCOS, and are obtained via a stride- convolution on and respectively.

Prediction Head Search Space

Prediction head maps each feature in the pyramid to the output of corresponding , which in FCOS and RetinaNet, consists of four convolutions. To explore the potential of the head, we therefore extend a sequential search space for its generation. Specifically, our head is defined as a sequence of six basic operations. Compared with candidate operations in the FPN structures, the head search space has two slight differences. First, we add standard convolution modules (including convx and convx) to the head sampling pool for better comparison. Second, we follow the design of FCOS by replacing all the Batch Normalization (BN) layers to Group Normalization (GN) [25] in the operations sampling pool of head, considering that head needs to share weights between different levels, which causes BN invalid. The final output of head is the output of the last (sixth) layer.

Searching for Head Weight Sharing

To add even more flexibility and understand the effect of weight sharing in prediction heads, we further add an index as the location where the prediction head starts to share weights. For every layer before stage , the head will create independent set of weights for each FPN output level, otherwise, it will use the global weights for sharing purpose.

Considering the independent part of the heads being extended FPN branch and the shared part as head with adaptive-length, we can further balance the workload for each individual FPN branch to extract level-specific features and the prediction head shared across all levels.

3.3 Search Strategy

RL based strategy is applied to the search process. We rely on an LSTM-based controller to predict the full configuration. We consider using a progressive search strategy rather than the joint search for both FPN structure and prediction head part, since the former requires less computing resources and time cost than the latter. The training dataset is randomly split into a meta-train and meta-val subset. To speed up the training, we fix the backbone network and cache the pre-computed backbone output . This makes our single architecture training cost independent from the depth of backbone network. Taking this advantage, we can apply much more complex backbone structures and utilize high quality multilevel features as our decoder’s input. We find that the process of backbone finetuning can be skipped if the cached features are powerful enough. Speedup techniques such as Polyak weight averaging are also applied during the training.

The most widely used detection metric is average precision (AP). However, due to the difficulty of object detection task, at the early stages, AP is too low to tell the good architectures from the bad ones, which makes the controller take much more time to converge. To make the architecture evaluation process easier even at the early stages of the training, we therefore use negative loss sum as the reward instead of average precision:


where , , are the three loss terms in FCOS. Gradient of the controller is estimated via proximal policy optimization (PPO) [22].

4 Experiments

4.1 Implementation Details

Searching Phase

We design a fast proxy task for evaluating the decoder architectures sampled in the searching phase. PASCAL VOC is selected as the proxy dataset, which contains training images with object bounding box annotations of classes. Transfer capacity of the structures can be illustrated since the search and full training phase use different datasets. The VOC training set is randomly split into a meta-train set with images and a meta-val set with images. For each sampled architecture, we train it on meta-train and compute the reward (2) on meta-val. Input images are resized to short size and then randomly cropped to . Target object sizes of interest are scaled correspondingly. We use Adam optimizer with learning rate e and batch size . Polyak averaging is applied with the decay rates of . The decoder is evaluated after iterations. As we use fast decoder adaptation, the backbone features are fixed and cached during the search phase. To enhance the cached backbone features, we first initialize them with pre-trained weights provided by open-source implementation of FCOS2 and then finetune on VOC using the training strategies of FCOS. Note that the above finetuning process is only performed once at the begining of the search phase.

A progressive strategy is used for the search of and . We first search for the FPN part and retain the original head. All operations in the FPN structure have output channels. The decoder inputs are resized to fit output channel width of FPN via convolutions. After this step, a searched FPN structure is fixed and the second stage searching for the head will be started based on it. Most parameters for searching head are identical to those for searching FPN structure, with the exception that the output channel width is adjusted from to to deliver more information.

For the FPN search part, the controller model nearly converged after searching over K architectures on the proxy task as shown in Fig. 2. Then, the top- best performing architectures on the proxy task are selected for the next full training phase. For the head search part, we choose the best searched FPN among the top- architectures and pre-fetch its features. It takes about rounds for the controller to nearly converge, which is much faster than that for searching FPN architectures. After that, we select for full training the top- heads that achieve best performance on the proxy task. In total, the whole search phase can be finished within days using V100 GPUs.

Decoder Backbone FLOPs (G) Params (M) AP
FPN-RetinaNet @ MobileNetV
FPN-FCOS @ MobileNetV
NAS-FCOS (ours) @ MobileNetV
NAS-FCOS (ours) @- MobileNetV
NAS-FCOS (ours) @ MobileNetV
FPN-RetinaNet @ R-
NAS-FCOS (ours) @ R-
NAS-FCOS (ours) @- R-
NAS-FCOS (ours) @ R-

FPN-RetinaNet @ R-
NAS-FCOS (ours) @ R-
FPN-FCOS @ X-xd-
NAS-FCOS (ours) @- X-xd-
FPN-FCOS @ w/improvements X-xd-
NAS-FCOS (ours) @- w/improvements X-xd-
Table 2: Results on test-dev set of MS COCO after full training. R- and R- represents ResNet backbones and X-xd- represents ResNeXt- (d). All networks share the same input image resolution. FLOPs and parameters are being measured on , which is the median of the input size on COCO. For RetinaNet and FCOS, we use official models provided by the authors. For our NAS-FCOS, @ and @ means that the decoder channel width is and respectively. @- is the decoder with FPN width and head width. The same improving tricks used on the newest FCOS version are used in our model for fair comparison.
Figure 2: Performance of reward during the proxy task, which has been growing throughout the process, indicating that the model of reinforcement learning works.

Full Training Phase

In this phase, we fully train the searched models on the MS COCO training dataset, and select the best one by evaluating them on MS COCO validation images. Note that our training configurations are exactly the same as those in FCOS for fair comparison. Input images are resized to short size and the maximum long side is set to be . The models are trained using V100 GPUs with batch size for K iterations. The initial learning rate is and reduces to one tenth at the K-th and K-th iterations. The improving tricks are applied only on the final model (w/improv).

Figure 3: Our discovered FPN structure. is omitted from this figure since it is not chosen by this particular structure during the search process.

4.2 Search Results

The best FPN structure is illustrated in Fig. 3. The controller identifies that deformable convolution and concatenation are the best performing operations for unary and aggregation respectively. From Fig. 4, we can see that the controller chooses to use operations (with two skip connections), rather than the maximum allowed operations. Note that the discovered “dconv + x conv” structure achieves a good trade-off between accuracy and FLOPs. Compared with the original head, our searched head has fewer FLOPs/Params (FLOPs G vs. G, Params M vs. M) and significantly better performance (AP vs. ).

Figure 4: Our discovered Head structure.

We use the searched decoder together with either light-weight backbones such as MobileNet-V2 [21] or more powerful backbones such as ResNet- [7] and ResNeXt- [26]. To balance the performance and efficiency, we implement three decoders with different computation budgets: one with feature dimension of (@), one with (@) and another with FPN channel width and prediction head (@-). The results on the COCO test-dev with short side being is shown in Table 2. The searched decoder with feature dimension of (@) surpasses its FCOS counterpart by to points in AP under different backbones. The one with channels (@) has significantly reduced parameters and calculation, making it more suitable for resource-constrained environments. In particular, our searched model with channels and MobileNetV2 backbone suparsses the original FCOS with the same backbone by AP points with only FLOPS. The third type of decoder (@-) achieves a good balance between accuracy and parameters. Note that our searched model outperforms the strongest FCOS variant by AP points ( vs. ) with slightly smaller FLOPs and Params. The comparison of FLOPs and number of parameters with other models are illustrated in Fig. 7 and Fig. 8 respectively.

Figure 5: Trend graph of head weight sharing during search. The coordinates in the horizontal axis represent the number of the statistical period. A period consists of head structures. The vertical axis represents the proportion of heads that fully share weights in structures.

In order to understand the importance of weight sharing in head, we add the number of layers shared by weights as an object of the search. Fig. 5 shows a trend graph of head weight sharing during search. We set structures as a statistical cycle. As the search deepens, the proportion of fully shared structures increases, indicating that on the multi-scale detection model, head weight sharing is a necessity.

Arch FLOPs (G) Search Cost (GPU-day) Searched Archs AP
NAS-FPN @ R- > #TPUs <
DetNAS-FPN-Faster -
DetNAS-RetinaNet -
NAS-FCOS (ours) @ R-
NAS-FCOS (ours) @- X-xd-

Table 3: Comparison with other NAS methods. For NAS-FPN, the input size is and the search cost should be timed by their number of TPUs used to train each architecture. Note that the FLOPs and AP of NAS-FPN @ here are from Figure in NAS-FPN [4], and NAS-FPN @ stacks the searched FPN structure times. The input images are resized such that their shorter size is 800 pixels in DetNASNet [2] and our models.
Figure 6: Correlation between the search reward obtained on the VOC meta-val dataset and the AP evaluated on COCO-val.

We also demonstrate the comparison with other NAS methods for object detection in Table 3. Our method is able to search for twice more architectures than DetNAS [2] per GPU-day. Note that the AP of NAS-FPN [4] is achieved by stacking the searched FPN times, while we do not stack our searched FPN. Our model with ResNeXt-101 (xd) as backbone outperforms NAS-FPN by AP points while using only FLOPs and less calculation cost.

Figure 7: Diagram of the relationship between FLOPs and AP with different backbones. Points of different shapes represent different backbones. NAS-FCOS@ has a slight increase in precision which also gains the advantage of computation quantity. One with channels obtains the highest precision with more computation complexity. Using FPN channel width and prediction head (@-) offers a trade-off.
Figure 8: Diagram of the relationship between parameters and AP with different backbones. Adjusting the number of channels in the FPN structure and head helps to achieve a balance between accuracy and parameters.

We further measure the correlation between rewards obtained during the search process with the proxy dataset and APs attained by same architectures trained on COCO. Specifically, we randomly sample architectures from all the searched structures trained on COCO with batch size . Since full training on COCO is time-consuming, we reduce the iterations to K. The model is then evaluated on the COCO validation set. As visible in Fig. 6, there is a strong correlation between search rewards and APs obtained from COCO. Poor- and well-performing architectures can be distinguished by the rewards on the proxy task very well.

Figure 9: Comparison of two different RL reward designs. The vertical axis represents AP obtained from the proxy task on the validation dataset.

4.3 Ablation Study

Design of Reinforcement Learning Reward

As we discussed above, it is common to use widely accepted indicators as rewards for specific tasks in the search, such as mIOU for segmentation and AP for object detection. However, we found that using AP as reward did not show a clear upward trend in short-term search rounds (blue curve in Fig. 9). We further analyze the possible reason to be that the controller tries to learn a mapping from the decoder to the reward while the calculation of AP itself is complicated, which makes it difficult to learn this mapping within a limited number of iterations. In comparison, we clearly see the increase of AP with the validation loss as RL rewards (red curve in Fig. 9).

Decoder Search Space AP
NAS-FCOS @ only
NAS-FCOS @ only
Table 4: Comparisons between APs obtained under different search space with ResNet-50 backbone.

Effectiveness of Search Space

To further discuss the impact of the search spaces and , we design three experiments for verification. One is to search with the original head being fixed, one is to search with the original FPN being fixed and another is to search the entire decoder (+). As shown in Table 4, it turns out that searching brings slightly more benefits than searching only. And our progressive search which combines both and achieves a better result.

Impact of Deformable Convolution

As aforementioned, deformable convolutions are included in the set of candidate operations for both and , which are able to adapt to the geometric variations of objects. For fair comparison, we also replace the whole standard convolutions with deformable convolutions in FPN structure of the original FCOS and repeat them twice, making the FLOPs and parameters nearly equal to our searched model. The new model is therefore called DeformFPN-FCOS. It turns out that our NAS-FCOS model still achieves better performance (AP with FPN search only, and AP with both FPN and Head searched) than the DeformFPN-FCOS model (AP ) under this circumstance.

5 Conclusion

In this paper, we have proposed to use Neural Architecture Search to further optimize the process of designing object detection networks. It is shown in this work that top-performing detectors can be efficiently searched using carefully designed proxy tasks, search strategies and model evaluation metrics. The experiments on COCO demonstrates the efficiency of our discovered model NAS-FCOS and its flexibility to be used with various backbone architectures.


  1. thanks: NW, YG, HC contributed to this work equally.
  3. thanks: NW, YG, HC contributed to this work equally.


  1. H. Cai, L. Zhu and S. Han (2018) ProxylessNAS: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.2.
  2. Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan and J. Sun (2019) DetNAS: neural architecture search on object detection. arXiv preprint arXiv:1903.10979. Cited by: §2.2, §2.2, §4.2, Table 3.
  3. T. Elsken, J. H. Metzen and F. Hutter (2018) Neural architecture search: a survey. arXiv preprint arXiv:1808.05377. Cited by: §1.
  4. G. Ghiasi, T. Lin, R. Pang and Q. V. Le (2019) NAS-FPN: learning scalable feature pyramid architecture for object detection. Proc. IEEE Conf. Comp. Vis. Patt. Recogn.. Cited by: §1, §2.2, §2.2, §4.2, Table 3.
  5. Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §2.2.
  6. K. He, G. Gkioxari, P. Dollár and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.1.
  7. K. He, X. Zhang, S. Ren and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §3.1, §4.2.
  8. A. Kirillov, R. Girshick, K. He and P. Dollár (2019) Panoptic feature pyramid networks.. arXiv: Computer Vision and Pattern Recognition. Cited by: §2.2.
  9. T. Kong, F. Sun, H. Liu, Y. Jiang and J. Shi (2019) FoveaBox: beyond anchor-based object detector. arXiv preprint arXiv:1904.03797. Cited by: §2.1.
  10. H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2.1.
  11. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie (2017) Feature pyramid networks for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2117–2125. Cited by: §1, §2.2, §2.2.
  12. T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2980–2988. Cited by: §2.1, §2.1, §2.2, §2.2.
  13. C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. Proc. IEEE Conf. Comp. Vis. Patt. Recogn.. Cited by: §2.2.
  14. H. Liu, K. Simonyan and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2.2.
  15. H. Liu, C. Peng, C. Yu, J. Wang, X. Liu, G. Yu and W. Jiang (2019) An end-to-end network for panoptic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. . Cited by: §2.2.
  16. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. Berg (2016) SSD: single shot multibox detector. In Proc. Eur. Conf. Comp. Vis., pp. 21–37. Cited by: §2.1, §3.1.
  17. V. Nekrasov, H. Chen, C. Shen and I. Reid (2019) Fast neural architecture search of compact semantic segmentation models via auxiliary cells. Proc. IEEE Conf. Comp. Vis. Patt. Recogn.. Cited by: §1, §2.2.
  18. H. Pham, M. Y. Guan, B. Zoph, Q. V. Le and J. Dean (2018) Efficient neural architecture search via parameter sharing. In ICML, Cited by: §1.
  19. J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. arXiv. Cited by: §2.1.
  20. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Proc. Advances in Neural Inf. Process. Syst., pp. 91–99. Cited by: §1, §2.1, §2.1, §3.1.
  21. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4510–4520. Cited by: §4.2.
  22. J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov (2017) Proximal policy optimization algorithms. arXiv: Comp. Res. Repository. Cited by: §3.3.
  23. D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu and D. Marculescu (2019) Single-path NAS: designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2.2.
  24. Z. Tian, C. Shen, H. Chen and T. He (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: NAS-FCOS: Fast Neural Architecture Search for Object Detection3, §1, §2.1, §2.2.
  25. Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §3.2.2.
  26. S. Xie, R. Girshick, P. Dollár, Z. Tu and K. He (2016) Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431. Cited by: §4.2.
  27. T. Zhao and X. Wu (2019) Pyramid feature attention network for saliency detection.. arXiv: Computer Vision and Pattern Recognition. Cited by: §2.2.
  28. H. Zhou, M. Yang, J. Wang and W. Pan (2019) BayesNAS: a bayesian approach for neural architecture search. arXiv preprint arXiv:1905.04919. Cited by: §2.2.
  29. X. Zhou, D. Wang and P. Krähenbühl (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §2.1.
  30. C. Zhu, Y. He and M. Savvides (2019) Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621. Cited by: §2.1.
  31. X. Zhu, H. Hu, S. Lin and J. Dai (2018) Deformable convnets v2: more deformable, better results. arXiv preprint arXiv:1811.11168. Cited by: §3.2.
  32. B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description