Black-Box Optimization of Object Detector Scales

Black-Box Optimization of Object Detector Scales

Abstract

Object detectors have improved considerably in the last years by using advanced CNN architectures. However, many detector hyper-parameters are generally manually tuned, or they are used with values set by the detector authors. Automatic Hyper-parameter optimization has not been explored in improving CNN-based object detectors hyper-parameters. In this work, we propose the use of Black-box optimization methods to tune the prior/default box scales in Faster R-CNN and SSD, using Bayesian Optimization, SMAC, and CMA-ES.

We show that by tuning the input image size and prior box anchor scale on Faster R-CNN mAP increases by 2% on PASCAL VOC 2007, and by 3% with SSD. On the COCO dataset with SSD there are mAP improvement in the medium and large objects, but mAP decreases by 1% in small objects. We also perform a regression analysis to find the significant hyper-parameters to tune.

Keywords:
Object Detection, Scale Tuning, Black-Box Optimization, Hyper-parameter Tuning

1 Introduction

Object detection deals with classifying and localizing objects of interest in a given image. In the recent years, a great deal of research has been done in object detection. Moreover, it has multiple application domain such as autonomous cars, anomaly detection, medical image analysis, video surveillance, and so on. Advancements in Convolutional Neural Networks (CNN) have taken deep learning based object detection a step forward as they have proved to outperform traditional computer vision methods in benchmark datasets like MS-COCO [19] and PASCAL VOC [6].

The ability of CNN architectures to represent image high-level features is one of the reasons for remarkable performance in state-of-the-art object detectors [35]. However, performance depends heavily on the selection of various hyper-parameters that guide and control the learning process. Every object detection methods has several hyper-parameters like input image dimensions, size and scales of prior/default anchor boxes, multi-task loss weights, number of output proposals, in addition to conventional neural network hyper-parameters like learning rate, momentum and decay rate. The right choice of hyper-parameters is essential because it plays a significant role in the model’s performance [7].

Hyper-parameter tuning is challenging as one needs to choose the right settings from the high-dimensional search space efficiently. Domain experts have insight on setting hyper-parameters. Also, they conduct many experiments and choose values after many trial and error runs. Furthermore, hyper-parameters depends on the dataset as values which work fine for one dataset may not provide the same performance on a different one [13]. Hyper-parameter diversity is also problematic as they can be binary, categorical, continuous and conditional [13].

The current growth of the machine learning field has created a need for automating this laborious process by avoiding human intervention. Automated machine learning (AutoML) [14] is a newly emerging field which aims to automate the entire machine learning process. Besides AutoML, Black-box optimization methods can also be applied for the task of hyper-parameter optimization. Most basic methods are grid search [28] and random search [1]. However, both methods take a considerable amount of time and are computationally expensive.

Guided search can reduce the computational complexity and the time taken to find the right set of hyper-parameters. Bayesian optimization (BO) [27] is a guided way as it considers prior information. It has proved to achieve better results with fewer computations when compared to Grid [28] and Random search [1] in image classification, speech recognition and natural language modeling [32][33][25][5]. Also, population-based approaches, namely Genetic Algorithms, Genetic Programming, Evolutionary Strategy, Evolutionary Programming and Particle Swarm Optimization have shown remarkable results [4].

All the previously discussed methods have not been applied in tuning object detection hyper-parameters. These are generally tuned manually by trial and error by the community, or used without tuning as they are defined by their authors, which might be sub-optimal for a specific dataset.

This work intends to study the applicability of Black-box optimization methods such as Bayesian Optimization [27] and Co-variance Matrix Adaptation Evolution Strategy (CMA-ES) [11] for tuning object detection hyper-parameters, in particular the anchor/default box scales. This will allow the detector to be specifically tuned to a particular dataset, instead of being hand-tuned and producing sub-optimal performance.

To validate the proposed approach, black-box hyper-parameter optimization methods are used to optimize the Single-Shot MultiBox Detector [21] and Faster R-CNN [30] detectors on a variety of datasets to achieve the best performance. We find that the Black-box optimization is able to improve mAP performance and achieve better results than the hand-tuned configurations in most of the cases.

The contributions of this work are: we propose the use of black-box optimization methods to tune the prior boxes of Faster R-CNN, and the default boxes of SSD. We show that by using these methods, performance in terms of mAP on PASCAL VOC and MS-COCO increases by around 1-3%. We also show that the scales learned with black-box optimization transfer from PASCAL VOC 2007 to VOC 2012, with an mAP improvement as well, and we perform a regression analysis to find out which are the most important hyper-parameters to tune.

It should be noted that the objective of this work is not to improve or beat state of the art detectors in many datasets, but to show the importance of automatic hyper-parameter tuning for object detection, in particular to tune the scales and input image size automatically, without a manual process. Our aim of this paper is important for real-use-cases trying to use state of the art object detectors in novel datasets, specially for non-experts in computer vision.

2 Related Work

2.1 Black-Box Optimization in Deep Learning

Black-box optimization methods are widely used in tuning hyper-parameters of deep learning algorithms. Bayesian optimization with Gaussian processes (BOGP) was first used to optimize hyper-parameters of an image recognition deep learning architecture on CIFAR-10 [16] and was able to achieve 3% increase over the state of the art in 2012 [32]. A new approach in Bayesian Optimization, called as Deep Networks for Global Optimization (DNGO) [33] uses neural networks instead of Gaussian processes (GP) as a surrogate model for fitting distributions over the objective functions. Bayesian optimization with DNGO also supports parallelism for hyper-parameter optimization. DNGO was used to tune the hyper-parameters of various deep learning problems such as image classification and image caption generation.

In the work presented in [24], CMA-ES is compared with Bayesian optimization for hyper-parameter optimization of deep convolutional neural networks for MNIST classification problem. The Particle Swarm Optimization (PSO) algorithm has also proved beneficial for a hyper-parameter optimization problem. In the work presented in [22] and [23], PSO is used for optimizing the hyper-parameters of the Deep Neural Network (DNN) in parallel and quickly on the CIFAR-10 dataset. [7] has discussed various notable strategies in hyper-parameter optimization. However, these methods have not been explored much in tuning the hyper-parameters specific to deep learning-based object detectors. In this work, we focus on the area of automatic tuning of object detector hyper-parameters using black-box optimization methods.

2.2 Object Detection Hyper-parameters Selection

The designing of prior/default anchor boxes is a big challenge in object detection. The design process completely depends on the size and ratio of objects to be detected in a particular dataset. [36] proposed an approach to dynamically adapt the design of anchor boxes using the gradients of the loss function. The anchor box optimization method was integrated into YOLO v2 [29] and had obtained a 1% mAP gain in MS-COCO [19] and PASCAL VOC [6] datasets.

In YOLOv2 [29], k-Means clustering is used to determine the prior anchor boxes. The ground truth bounding boxes in the training set are clustered based on the Intersection Over Union (IOU) scores instead of the conventional Euclidean distance, as it produces error for larger boxes than smaller ones. Finally, the cluster centroids are used for designing the anchor boxes. Our approach has an advantage over these methods as it is not constrained only to anchor boxes and can include other object detection hyper-parameters.

3 Proposed Approach

In our experiments we use Bayesian optimization and CMA-ES for tuning the object detection hyperparameters. In this section we will briefly discuss about Bayesian Optimization [27] and CMA-ES [11].

3.1 Bayesian Optimization

Bayesian optimization (BO) [27] is an efficient and effective method for global optimization problems. Bayesian optimization over the years has also evolved as a prominent solution for hyperparameter optimization problems. In Bayesian optimization, initially a random set of hyperparameters is evaluated, and a probabilistic surrogate model is fit to this data ’D’. This probabilistic surrogate model is used by an acquisition function to compute a utility score for a different set of hyperparameters. Then the hyperparameters with a better score will be evaluated on the actual objective function, and the evaluation result is used to update the probabilistic surrogate model. The optimization process goes in iteration until a computational budget is met.

In a nutshell, the two main components of the Bayesian optimization are the probabilistic surrogate model and acquisition function. The probabilistic surrogate model is the Bayesian approximation of the actual objective function to derive samples efficiently. The acquisition function implies a trade-off between exploration and exploitation. We chose Expected improvement (EI) [26] over other acquisition function like Probability of Improvement (PI) [17] and Gaussian process upper confidence bound (GP-UCB) [15] as EI has proved to have the best convergence rates [2].

In our experiments, we use Gaussian Processes (GP) based BO, and Random Forests based BO, which is called Sequential Model-based Algorithmic Configuration (SMAC) [12].

Figure 0(a) shows the flowchart of the BO pipeline for object detection used in our experiments. We use the SMAC3 1 [20] implementation, containing BO and SMAC implementations.

(a) Bayesian Optimization
(b) CMA-ES
Figure 1: Hyperparameter Optimization Pipeline for Object Detection

3.2 Covariance Matrix Adaptation Evolution Strategy

Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [11] is also used for continuous hyper-parameter tuning. CMA-ES is an evolutionary algorithm used for derivative-free optimization of non-linear, non-convex continuous objective functions. The main advantage of CMA-ES over other evolutionary algorithms is that it can behave well with small population sizes. In a nutshell, CMA-ES generate samples from a multi-variate normal distribution and evaluates these samples in the objective function to obtain a fitness value for every sample. Based on the fitness of the samples, the multi-variate normal distribution is adjusted to generate new samples in the next generation (iteration). This process is repeated until a best sample is found or a computation budget is met. Figure 0(b) shows the CMA-ES pipeline flowchart for object detection used in our experiments. We use the pycma2 CMA-ES implementation [10].

3.3 Hyper-Parameter Space Design

In this section we describe the hyper-parameter space design for two selected object detection models.

We chose these two detectors since they perform detections using different approaches and could validate the generality of our approach. Specifically, we chose SSD and Faster R-CNN which are defined respectively as one-stage and two-stage detectors. There are more recent detectors based on these designs, but we wanted to evaluate the original versions as to separate the effect of other improvements with hyper-parameter tuning.

Faster R-CNN [30] [31] uses a Region Proposal Network (RPN) for identifying regions of interest which replaces the selective search algorithm in R-CNN [8] and Fast R-CNN [9]. The RPN is more computationally efficient and also has a better detection performance. In Faster R-CNN, the image is fed into a pre-trained CNN to produce a convolutional feature map. This feature map is then used by the RPN to compute region proposals. Similar to Fast R-CNN, these region proposals are filtered by ROI pooling and fed into a fully connected layer for classifying objects.

Region proposals are obtained by sliding a window over the convolutional feature map. A set anchor boxes with different scales and ratios are used at each feature map location. In total there are anchors for a feature map of height and Width . The original implementation of this model uses nine anchor boxes with three different ratios and scales . Figure 3(a) shows the 9 anchor boxes at point for an image of size . The authors of Faster R-CNN do not mention how the anchor box parameters should be set. However, they perform an ablation study that varies the number of scales and ratios, indicating that 9 anchor boxes seems to be optimal [31]. Performance of the detector is shown to be sensitive to the input image size, scale and ratio of the anchor boxes used. The input image size is defined as the pixel size of an image’s shortest side.

For this detector we consider three sets of hyper-parameters: The input image size, and the anchor box scale and ratios. In order to reduce the number of assumptions, we take all hyper-parameters to be continuous, with a given transformation into the original parameters of Faster R-CNN, as specified by Table 1.

Hyper-param Range Scaled Value Type Transform Pixel Value
Input Image Size 0.6 C 600
Anchor Ratio 1 0.25 C 0.5
Anchor Ratio 2 0.5 C 1
Anchor Ratio 3 1 C 2
Anchor Scale 1 0.25 C 128
Anchor Scale 2 0.5 C 256
Anchor Scale 3 1 C 512
Table 1: Hyper-parameter Space for Faster R-CNN, with .
Figure 2: SSD Architecture showing prior boxes at each layer. Image adapted from [21]

Single Shot Multi-Box Detector (SSD) [21] is a single-stage object detector. This architecture uses feature maps obtained from the backbone network VGG to construct additional convolutional layers corresponding to different anchor scales and ratios. Figure 7 shows the prior anchor boxes with varying scales and ratios on each convolutional layer.

Each feature map location predicts the offsets relative to the multiple prior anchor boxes along with the confidences for each class. The loss function is computed by matching the prior anchor boxes with the ground truth boxes using the IOU score. Hence, designing prior boxes scales and ratios is essential for having better alignments of these prior boxes to each position of the feature map. The design of prior boxes depends on the size and ratio of objects in the dataset. In SSD, the scale sizes are determined using Equation 1:

(1)

Where represents the scale size at feature map and denotes the number of feature maps. The minimum scale size is , and maximum scale size is . The conv prior box scale is fixed to . There are four non-square anchor ratios and one square ratio () for each scale. There is also a constant box with anchor ratio 1 whose scale is computed using . The width and height of the default boxes are computed as:

(2)

Some feature map skip the anchor ratios. While training SSD with the MS COCO dataset, the authors reduce the minimum scale and conv prior box scale to tackle the smaller objects in MS-COCO dataset.

For this detector, we decided to tune the scales and treat them as continuous hyper-parameters, as shown in Table 2. All other hyper-parameters’ values are kept from the original implementation.

Hyper-param Range Type Feature map Anchor Box
Width Height
Scale 0 () C Conv 4_3
Scale 1 () C Conv 7(fc7)
Scale 2 () C Conv 8_2
Scale 3 () C Conv 9_2
Scale 4 () C Conv 10_2
Scale 5 () C Conv 11_2
Scale 6 () C NA NA NA
Table 2: Hyper-parameter Space for the Single-Shot MultiBox Detector (SSD). denotes the anchor ratio.

In the following two sections we show our main experimental results. We evaluate Faster R-CNN and SSD on the PASCAL VOC 2007/2012 and MS COCO datasets, and learn hyper-parameters using CMA-ES and Bayesian Optimization.

4 Experiments on Faster R-CNN

The training and validation split of PASCAL VOC 2007 is used for training, and the fitness score (mAP) is computed on the test split of PASCAL VOC 2007. Table 1 in the supplementary shows the CMA-ES parameters used in this experiment. Training and evaluation of Faster R-CNN for all hyper-parameters of a particular generation were done in parallel. We use the Faster Pytorch Implementation 3 [34] of Faster R-CNN.

Figure 3 shows the performance of Faster R-CNN optimized using CMA-ES for 25 generations (225 evaluations) on PASCAL VOC 2007 dataset. This setup achieves an mAP of 71.78 % which is 1.79% more than the default hyper-parameters (69.9% mAP). Figure 4 shows the comparison of prior boxes in the original implementation of Faster R-CNN with the prior boxes found using CMA-ES. Table 4 shows an overall comparison of all our results along with the associated anchor scales and aspect ratios.

Figure 3: Faster R-CNN -Optimization by CMA-ES on PASCAL VOC 2007

4.1 Regression Analysis

An important question is to determine which are the best scales to tune, as some scales might be more important than others. To answer this question, we performed regression analysis between the object detector performance, as measured by mAP, and the individual hyper-parameters. Regression analysis can explain the relationship between a dependent variable and many independent variables, and can also explain the significance of the independent variables.

For this we normalize the mAP and all hyper-parameters to the range, and trained a linear regression model. The coefficients of this model can then be interpreted as importance scores, which should tell us about which hyper-parameter is more important. We measure the goodness of fit using the coefficient of determination , which indicates the amount of variance explained by the independent variable (the mAP).

Table 3 shows results of the regression analysis with Faster R-CNN hyper-parameters. The higher coefficient value of image scale indicates that tuning the image scales in Faster R-CNN is very important, but the highest coefficient is associated to the input image size (), indicating that it has the highest effect on mAP. Anchor Scales one and two have a significant coefficient value, indicating their importante, which we believe makes sense as they have the biggest receptive field size. Adjusting the ratios () seems to have little impact on the overall mAP.

(a) Original prior anchor boxes. Image scale = 600
(b) Prior anchor boxes found using CMA-ES. Image scale = 699
Figure 4: Faster R-CNN prior anchor boxes comparison for PASCAL VOC 2007
Method Coefficients
Input
Size
Scale
1
Scale
2
Scale
3
Ratio
1
Ratio
2
Ratio
3
CMA-ES 0.52 0.67 0.25 0.19 0.02 0.02 0.01 0.07
CMA-ES
(Without )
0.21 0.32 0.15 0.01 0.01 0.02 0.08
Table 3: Regression Analysis of Faster R-CNN Hyper-parameters
Settings Anchor Scales Aspect Ratios mAP(%)
Default 0.5, 1, 2 69.9
BOGP 0.941 , 1.155, 2.015 71.37
SMAC 0.4, 0.5, 1 71.56
CMA-ES 0.259, 0.964, 1.741 71.78
Table 4: Summary of our Faster R-CNN results on PASCAL VOC test 2007.

5 Experiments on SSD

5.1 Prior Boxes Optimization using CMA-ES

Quality of optimization using CMA-ES highly depends on setting proper initial parameters. We set the population size for PASCAL VOC 2007 based on [11]. For MS-COCO, we use a smaller population size based on our GPU resource availability. CMA-ES parameters that we use are shown in Table 2 in the supplementary. We set the initial scales according to Equation 1, with values . Mean average precision (mAP) is used as a fitness score for the individuals. We use a high quality implementation [18] of SSD4 in PyTorch.

(a) PASCAL VOC 2007
(b) MS COCO
Figure 5: Optimizing SSD hyperparameters using CMA-ES.

Anchor Scales on PASCAL VOC 2007. We use the training and validation split of PASCAL VOC 2007 for training, and the fitness score (mAP) is computed on the test split. We evaluate hyper-parameter configurations by training SSD on nine different Nvidia TitanXP GPUs. The training and evaluation of a single hyper-parameter configuration on SSD takes around 24.5 hours. Hence, the evaluation of one generation can be completed in a single day if it is done in parallel.

Figure 4(a) shows the performance of SSD optimized using CMA-ES for 25 generations (225 evaluations). We can clearly see performance improvements over generations. The median mAP also increases with each generation, depicting the auto-tuning of anchor scales. Figure 6 shows the box plot of each anchor scale values over generation, and it illustrates the convergence of each anchor scale towards the best value. The best anchor scale configuration found using CMA-ES has achieved a mean average precision of 71.55%, which is 3.55% higher than the default scales (68.0 % mAP).

Figure 6: Box plot of SSD Anchor Scale values over generations on PASCAL VOC 2007

Anchor Scales on MS COCO. We also evalaute on MS COCO using CMA-ES. For this dataset, the authors of SSD reduce the minimum scale and prior box scale to tackle the smaller objects in the MS-COCO dataset as 5 and respectively. Instead of using the modified scales as an initial scales for CMA-ES, we used the regularly placed scales according to Equation 1, and the constant box scale of to analyze the performance of CMA-ES for a dataset with a lot of smaller objects. We use the trainval35k split of MS COCO 2014 for training, and the fitness score (mAP) is computed on the minival split. The training and evaluation of a single hyper-parameter configuration took around 44 hours.

Figure 4(a) shows the performance of SSD optimized using CMA-ES for 23 generations (92 evaluations). We clearly see the increase in fitness over generations by tuning the anchor scales using CMA-ES. The anchor scales found using CMA-ES adapted well as it is able to find smaller scales, appropriate for a dataset with many small objects. However, the anchor scales found by CMA-ES achieve 24.6% mAP which is 0.5% less in comparison with the original anchor scales. The reason for this is apparently we used a small population size due to GPU resource constraints. With a higher population size, CMA-ES should be able to find a better anchor scale as it had proved to optimize the scales with an increase in generations with even small population size. Table 5 shows the detailed mAP results, with improvements in AP for medium and large object sizes

Settings Avg. Precision (AP) Avg. Precision (AP)
IOU 0.5:0.95 0.5 0.75 Area Small Medium Large
Default 25.1 43.1 25.8 6.6 25.9 41.4
CMA-ES 24.6 42.8 25.1 4.8 26.3 42.9
Table 5: Comparison of SSD Scales on MS COCO - CMA-ES vs Default Scales

5.2 Prior Boxes Optimization using Bayesian Optimization

We also experiment with Bayesian Optimization to tune the prior anchor box scales of SSD for PASCAL VOC 2007. We set the same hyper-parameter space of the anchor scale as shown in Table 2. Expected Improvement (EI) is used as the acquisition function in both BOGP and SMAC. The experiment was carried out sequentially with computation budget of 75 function evaluations. Our results are presented in Figure 7. Though both BOGP and SMAC achieve almost the same result best mAP, BOGP took less function evaluations when compared to both SMAC and CMA-ES. However, after finding the best scales, the performance of many anchor scales generated by BOGP are not satisfactory, which is indicated by the plot scattering. This is because the optimization algorithm might try to increase its exploration space in search of better hyper-parameters.

A summary of our results is shown in Table 6, including the associated anchor scales. We include a baseline where we obtained anchor scales using k-means on the training set, similar to YOLOv2 [29]. There was only a small improvement by using k-means scales, and it is outperformed by all hyper-parameter tuning methods.

Figure 7: Optimizing SSD hyper-parameters using Bayesian Optimization and SMAC on PASCAL VOC 2007
Method Anchor Scales mAP
Scale
0
Scale
1
Scale
2
Scale
3
Scale
4
Scale
5
Scale
6
Default 0.1 0.2 0.37 0.54 0.71 0.88 1.05 68.0
BOGP 0.0973 0.2132 0.3689 0.5224 0.5882 1.038 1.051 71.55
SMAC 0.1037 0.1935 0.3707 0.5542 0.6934 0.9481 1.0481 71.53
CMA-ES 0.0908 0.1892 0.3276 0.5240 0.6865 0.8390 0.8894 71.50
k-Means N/A N/A N/A N/A N/A N/A N/A 68.35
Table 6: Summary of our SSD results on PASCAL VOC test 2007.

5.3 Generalization Analysis

So far we have optimized object detector scales based on its evaluation score on PASCAL VOC 2007 test dataset. Hyper-parameter optimization may also overfit to the evaluation dataset [3]. To verify the generalization of the learned scales, we tested the scales learned on PASCAL VOC 2007 on a different dataset, namely the PASCAL VOC 2012 validation set. As seen in Table 7, it is evident that the optimized models still able to perform better than the hand-tuned scales on the PASCAL VOC 2012 validation set. Table 7 indicates that SMAC has improved generalized performance in comparison with other methods.

Dataset Default BO-GP SMAC CMA-ES k-Means
VOC 2007 Test 68.0 71.55 71.53 71.50 68.35
VOC 2012 Val 61.74 63.54 64.43 63.63 61.91
Table 7: Performance comparison of optimized SSD scales on PASCAL VOC test 2007 and PASCAL VOC validation 2012

5.4 Regression Analysis

Similarly to Section 4.1, we performed regression analysis to find the importance of each anchor scale. We used the absolute of the coefficient to find the most important hyper-parameter. Table 8 shows our results. In BO and SMAC, we can see anchor scale zero and one have the largest coefficients. Moreover, both BOGP and SMAC agrees that anchor scale 0 is a very important hyper-parameter, and the high score also indicates a good fit, validating that this scale is probably the most important to tune.

The higher score of regression from SMAC results indicates a good explanation of the mAP given the anchor scales. In CMA-ES, anchor scales two, three, and four have the largest coefficients, but the low score makes this relation not significant.

It is interesting that the explainability of the mAP given the scales varies considerably with the hyper-parameter optimization methods. One would imagine that there should be no such relation, but each method guides sampling on the hyper-parameter space in a different way, which might introduce certain biases.

Method Coefficients
Scale
0
Scale
1
Scale
2
Scale
3
Scale
4
Scale
5
Scale
6
CMA-ES 0.15 0.15 0.04 0.38 0.50 0.28 0.03 0.05
BOGP 0.66 0.54 0.31 0.12 0.04 0.02 0.00 0.22
SMAC 0.82 0.82 0.17 0.02 0.06 0.12 0.09 0.08
Table 8: Regression analysis with SSD hyper-parameters on PASCAL VOC 2007

6 Conclusions and Future Work

In this paper, we demonstrate the performance of Black-box optimization for object detection hyper-parameters, in particular the default box/anchor scales, on PASCAL VOC 2007 and MS COCO. From our experimental results, we can conclude that using Black-box optimization produces an improvement on mAP by adjusting the anchor/prior box scales of Faster R-CNN and SSD on the PASCAL VOC 2007 and MS-COCO datasets, and generally can achieve better results than the hand-tuned configurations in most of the cases. In MS-COCO we observed decreased performance for small objects, and only medium and large objects see an improvement in mAP.

GP-based Bayesian Optimization obtains better performance with less function evaluations. CMA-ES results show a clear view of the improvement in performance with an increasing number of generations. BOGP and SMAC can be studied more in with different initial designs.

We also evaluated the transferability of the learned scales with SSD on PASCAL VOC 2007, by evaluating these scales on the PASCAL VOC 2012 validation set, which also shows an improvement in mAP. We believe this shows that learning scales using these methods is a good alternative to manually designing the scales by a human. It is possible that the scales currently used by object detectors are sub-optimal, in terms that each dataset has probably a different set of optimal scales. Using automatic tuning methods should help practitioners tune an object detector for a particular dataset with minimal effort.

Finally, for each combination we performed a simple regression analysis to find out the importance of each hyper-parameter to the overall mAP. We find that for Faster R-CNN, the biggest factor is the input image size, while for SSD the first scales contribute more to increasing mAP. This information is valuable for future research, as efforts can be concentrated on the most important hyper-parameters, decreasing search time and computational costs.

Broader research is needed on larger hyper-parameter spaces by including other object detection hyper-parameters that were not used in our experiments, for example, it would be very interesting to tune not only the actual scales, but also the number of scales, which may vary with the dataset, in particular with MS COCO as it contains many small objects. The multi-task loss weights are also not generally tuned, which could be a possible source of improvement. Multi-objective optimization is can be used to choose the hyper-parameters that minimize prediction time while maximizing the task performance (higher mAP) of the object detector.

Appendix A Parameters of Optimization Methods

In this section we provide details of the parameters of CMA-ES, as shown in Table 9 and Table 10.

Dataset
Step Control
Population
Size
Maximum
Evaluations
Number
of
Genes
Number
of
Generations
PASCAL
VOC 2007
0.3 6 150 7 25
Table 9: CMA-ES parameters used in this experiments with Faster R-CNN and initial vector [0.6 ,0.25, 0.5, 1.0, 0.25, 0.5, 1.0]
Dataset
Step
control
Population
Size
Maximum
Evaluations
Number of
Genes
Number of
Generations
PASCAL
VOC 2007
0.3 9 225 7 25
MS-COCO 0.3 4 92 7 23
Table 10: CMA-ES parameters used for SSD experiments. The initial vector is set to [0.1, 0.2, 0.37, 0.54, 0.71, 0.88, 1.05].

Appendix B Additional Details of PASCAL VOC Results

In this section we provide details on the per-class average precision for SSD and Faster R-CNN with different hyper-parameter optimization methods. These results are available in Table 11 for SSD, and in Table 12 for Faster R-CNN.

In both cases we see modest improvements in class AP over the default hyper-parameters, up to 4% in absolute, for example, for the horse class in Faster R-CNN.

Method /
Objects
Default BOGP SMAC CMA-ES k-Means
aero 73.4 75.51 76.03 74.4 76.41
bike 77.5 80.17 79.19 78.86 75.65
bird 64.1 67.24 69.13 67.24 63.92
boat 59.0 62.69 66.49 63.80 62.46
bottle 38.9 41.24 41.88 41.45 39.13
bus 75.2 82.72 80.02 79.20 76.19
car 80.8 83.51 83.21 83.07 82.03
cat 78.5 82.76 81.56 84.28 78.36
chair 46.0 52.48 53.18 52.13 48.82
cow 67.8 76.78 77.37 76.28 74.36
dining table 69.2 69.50 70.50 69.02 63.32
dog 76.6 78.66 79.31 79.64 77.28
horse 82.1 81.91 82.83 83.74 78.70
motor bike 77.0 80.28 80.17 80.11 76.10
person 72.5 75.37 74.91 74.98 69.83
potted plant 41.2 44.62 43.57 44.47 41.42
sheep 64.2 68.05 66.97 70.84 66.48
sofa 69.1 73.67 71.49 71.45 70.09
train 78.0 83.34 82.37 83.95 77.39
monitor 68.5.5 70.56 70.45 71.03 69.03
mAP 68.0 71.55 71.37 71.50 68.35
Table 11: Performance comparison of optimized SSD object detector’s average precision of objects by different blackbox methods on PASCAL VOC test 2007
Method \
Objects
Default BOGP SMAC
aero 70.00 73.71 71.6
bike 80.6 80.66 79.4
bird 70.1 69.91 68.5
boat 57.3 56.19 56.90
bottle 49.9 57.11 57.06
bus 78.2 76.94 80.88
car 80.4 82.94 84.42
cat 82.0 82.46 83.61
chair 52.2 50.75 50.85
cow 75.3 79.96 78.63
dining table 67.2 70.1 70.47
dog 80.3 79.27 80.65
horse 79.8 83.21 82.56
motor bike 75.0 75.48 76.69
person 76.3 77.27 77.01
potted plant 39.1 43.23 44.38
sheep 68.3 71.33 71.22
sofa 67.3 67.34 66.71
train 81.1 77.17 76.59
monitor 67.6 72.45 73.13
mAP 69.9 71.37 71.56
Table 12: Performance comparison of optimized Faster R-CNN object detector’s average precision of objects by different blackbox methods on PASCAL VOC test 2007

We also provide details on the anchor scales that we obtained using automatic machine learning.

Figures 8, 9, and 10 show a comparison of the learned anchor boxes with the original anchor boxes in SSD. The changes are very subtle, giving evidence that the original anchor boxes might be good enough for most use cases, but still there are large changes in the biggest scales (Figure 10).

(a) Prior anchor boxes for feature map at conv 4_3
(b) Prior anchor boxes for feature map at conv 7
(c) Prior anchor boxes for feature map at conv conv 8_2
(d) Prior anchor boxes for feature map at conv 9_2
(e) Prior anchor boxes for feature map at conv 10_2
(f) Prior anchor boxes for feature map at conv 11_2
Figure 8: Prior anchor boxes used in the original implementation of SSD on PASCAL VOC 2007

BOGP

SMAC

CMA-ES

Figure 9: Prior anchor boxes comparison of layers conv 4_2, fc 7 and conv 8_2 in SSD on PASCAL VOC 2007. First column: BOGP Second Column: SMAC, Third Column: CMA-ES

BOGP

SMAC

CMA-ES

Figure 10: Prior anchor boxes comparison of layers conv 9_2, conv 10_2 and conv 11_2 in SSD on PASCAL VOC 2007. First column: BOGP Second Column: SMAC, Third Column: CMA-ES

Appendix C Additional Results on the Marine Debris Dataset

An important motivation behind this work is to apply hyper-parameter optimization to real use cases that train object detectors in novel datasets. Considering this, we applied CMA-ES for tuning SSD anchor scales on a new marine debris dataset that we captured.

This dataset was created with the purpose of detecting and capturing debris with a marine surface vehicle. It is used to train object detectors, with preference for a detector with high mAP that runs in a resource-constrained computer inside the surface vehicle. The object classes required for this task are not well represented in most object detection datasets, specially for marine scenes, so we decided to build our own dataset. This dataset is not public yet, it will be released in a future publication that we are preparing.

The images were collected and extracted from various online resources like TACO, the Open Images Dataset and Flickr Creative Commons Images, and some captured using a surface vehicle in a water tank. Annotation labels from TACO and Open Images are re-used and modified to suit our use case. We manually annotated other images coming from Flickr. Figure 11 and 12 shows some of the images from the Marine Debris Dataset.

The dataset consists of 2849 images for training, and 1079 images for evaluation. We annotated the following classes with bounding boxes:

  • Marine Vehicles: Ship, Boat, One person vehicle

  • Humans: Swimmer

  • Marine Structures: Pier, Dock, Bridge, Miscellaneous Structure

  • Marine Animals: Marine Mammal

  • Debris: Bottle, Package, Miscellaneous Debris, Wooden Debris, Debris Patch

Dataset
Step
control
Population
Size
Maximum
Evaluations
Number of
Genes
Number of
Generations
Marine
Debris
Dataset
0.3 8 112 7 14
Table 13: CMA-ES initial parameter settings on marine debris dataset
Figure 11: Sample Images from the Marine Debris Dataset
Figure 12: Sample Images from the Marine Debris Dataset

c.1 Optimization of SSD Hyper-parameters using CMA-ES on the Marine Debris Dataset

In this experiment, the prior anchor scales of SSD is optimized using CMA-ES. We use the same initial parameters and hyper-parameter space as the SSD experiments on PASCAL VOC. Table 13 shows the CMA-ES parameters used in this experiment.

The hyper-parameter configurations generated in each generation is evaluated parallelly by training SSD on eight different Nvidia TitanXP GPUs. The model is initially trained with learning rate for 20000 steps. Then, the training is continued with and learning rate for another 5000 steps, respectively. The batch size per training step is set to 32.

Figure 13: Optimization of SSD Hyper-parameters using CMA-ES on marine debris dataset (15 generations). The blue line indicates the median of mAP scores, while the orange line indicates the maximum of mAP scores of hyper-parameters evaluated in that particular generation. The increase in median over generations shows the evolution of distribution towards the better performing region. The red point shows the best hyper-parameter’s mAP score.
Method Anchor Scales mAP
Scale
0
Scale
1
Scale
2
Scale
3
Scale
4
Scale
5
Scale
6
Default 0.1 0.2 0.37 0.54 0.71 0.88 1.05 51.47
CMA-ES 0.04811 0.2257 0.3536 0.5765 0.7667 0.8600 1.03711 54.39
Table 14: Performance comparison of optimized SSD scales by different CMA-ES on marine debris datase. This demonstrates that scales tuned by CMA-ES achieves better results than the default configurations.

Figure 13 shows the optimization of SSD anchor scales over generations. The best anchor scales found using CMA-ES achieves an mAP of 54.39%, which is 2.92% greater than the default scales. It is evident from Figure 14 that the anchor scale 0 tuned by CMA-ES is smaller than the default anchor scale 0. Also there is a significant change in the anchor scale 4.

Moreover, the per-class average precision shown in Table 15 confirms the tuning of anchor scales has improved the performance of all objects except marine vehicle class.

Method
Marine
structure
(AP)
Marine
mammal
(AP)
Debris
(AP)
Marine
vehicle
(AP)
Swimmer
(AP)
mAP
Default 0.55 0.62 0.49 0.48 0.44 51.47
CMA-ES 0.58 0.65 0.51 0.47 0.49 54.39
Table 15: Comparison of average precision of objects on marine debris dataset - Default vs CMA-ES

Regression analysis. Similarly to previous experiments we perform regression analysis to find the important anchor scales. Table 16 shows that anchor scales zero and six have more impact on mAP when compared to other anchor scales.

Method Coefficients
Scale
0
Scale
1
Scale
2
Scale
3
Scale
4
Scale
5
Scale
6
CMA-ES 0.4 0.37 0.03 0.18 0.02 0.03 0.14 0.41
Table 16: Regression analysis of SSD hyper-parameters on the Marine Debris Dataset

Scales visualization. In Figures 14 and 15 we present a comparison the scales that have been learned in this dataset using CMA-ES with the original SSD scales. The differences are small but there are relevant changes, which connect to our regression analysis results in Table 16.

Default

CMA-ES

Figure 14: Prior anchor boxes comparison in layers conv 4_2, conv 7 and conv 8_2 of SSD on the Marine Debris Dataset. First column: Default Implementation, Second Column: CMA-ES

Default

CMA-ES

Figure 15: Prior anchor boxes comparison in layers conv9_2, conv10_2, and conv 11_2 of SSD on Marine Debris Dataset. First column: Default Implementation, Second Column: CMA-ES

Footnotes

  1. https://github.com/automl/SMAC3
  2. https://github.com/CMA-ES/pycma
  3. https://github.com/jwyang/faster-rcnn.pytorch
  4. https://github.com/lufficc/SSD

References

  1. J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §1, §1.
  2. A. D. Bull (2011) Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research 12 (Oct), pp. 2879–2904. Cited by: §3.1.
  3. G. C. Cawley and N. L. Talbot (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11 (Jul), pp. 2079–2107. Cited by: §5.3.
  4. C. A. C. Coello, G. B. Lamont and D. A. Van Veldhuizen (2007) Evolutionary algorithms for solving multi-objective problems. Vol. 5, Springer. Cited by: §1.
  5. G. E. Dahl, T. N. Sainath and G. E. Hinton (2013) Improving deep neural networks for lvcsr using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8609–8613. Cited by: §1.
  6. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn and A. Zisserman (2010-06) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §1, §2.2.
  7. M. Feurer and F. Hutter (2019) Hyperparameter optimization. See Automatic machine learning: methods, systems, challenges, Hutter et al., pp. 3–38. Cited by: §1, §2.1.
  8. R. Girshick, J. Donahue, T. Darrell and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §3.3.
  9. R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §3.3.
  10. N. Hansen, Y. Akimoto and P. Baudis (2019-02) CMA-ES/pycma on Github. Note: Zenodo, DOI:10.5281/zenodo.2559634 External Links: Document, Link Cited by: §3.2.
  11. Hansen and Nikolaus (2016) The cma evolution strategy: a tutorial. arXiv preprint arXiv:1604.00772. Cited by: §1, §3.2, §3, §5.1.
  12. F. Hutter, H. H. Hoos and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §3.1.
  13. F. Hutter, L. Kotthoff and J. Vanschoren (Eds.) (2018) Automated machine learning: methods, systems, challenges. Springer. Note: In press, available at http://automl.org/book. Cited by: §1.
  14. F. Hutter, L. Kotthoff and J. Vanschoren (Eds.) (2019) Automatic machine learning: methods, systems, challenges. Springer. Cited by: §1, 7.
  15. E. Kaufmann, O. Cappé and A. Garivier (2012) On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592–600. Cited by: §3.1.
  16. A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §2.1.
  17. H. J. Kushner (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86 (1), pp. 97–106. Cited by: §3.1.
  18. C. Li (2018) High quality, fast, modular reference implementation of SSD in PyTorch. Note: \urlhttps://github.com/lufficc/SSD Cited by: §5.1.
  19. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §2.2.
  20. M. Lindauer, K. Eggensperger, M. Feurer, S. Falkner, A. Biedenkapp and F. Hutter (2017) SMAC v3: algorithm configuration in python. GitHub. Note: \urlhttps://github.com/automl/SMAC3 Cited by: §3.1.
  21. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, Figure 2, §3.3.
  22. P. R. Lorenzo, J. Nalepa, M. Kawulok, L. S. Ramos and J. R. Pastor (2017) Particle swarm optimization for hyper-parameter selection in deep neural networks. In Proceedings of the genetic and evolutionary computation conference, pp. 481–488. Cited by: §2.1.
  23. P. R. Lorenzo, J. Nalepa, L. S. Ramos and J. R. Pastor (2017) Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1864–1871. Cited by: §2.1.
  24. I. Loshchilov and F. Hutter (2016) CMA-es for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269. Cited by: §2.1.
  25. G. Melis, C. Dyer and P. Blunsom (2017) On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589. Cited by: §1.
  26. J. Mockus, V. Tiesis and A. Zilinskas (1978) The application of bayesian methods for seeking the extremum. Towards global optimization 2 (117-129), pp. 2. Cited by: §3.1.
  27. J. Močkus (1975) On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pp. 400–404. Cited by: §1, §1, §3.1, §3.
  28. D. C. Montgomery (2001) Design and analysis of experiments john wiley & sons. Inc., New York 1997, pp. 200–1. Cited by: §1, §1.
  29. J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2.2, §2.2, §5.2.
  30. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §3.3.
  31. S. Ren, K. He, R. Girshick and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. Cited by: §3.3, §3.3.
  32. J. Snoek, H. Larochelle and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1, §2.1.
  33. J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat and R. Adams (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §1, §2.1.
  34. J. Yang, J. Lu, D. Batra and D. Parikh (2017) A faster pytorch implementation of faster r-cnn. https://github.com/jwyang/faster-rcnn.pytorch. Cited by: §4.
  35. Z. Zhao, P. Zheng, S. Xu and X. Wu (2019) Object detection with deep learning: a review. IEEE transactions on neural networks and learning systems. Cited by: §1.
  36. Y. Zhong, J. Wang, J. Peng and L. Zhang (2018) Anchor box optimization for object detection. arXiv preprint arXiv:1812.00469. Cited by: §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
420910
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description