BlackBox Optimization of Object Detector Scales
Abstract
Object detectors have improved considerably in the last years by using advanced CNN architectures. However, many detector hyperparameters are generally manually tuned, or they are used with values set by the detector authors. Automatic Hyperparameter optimization has not been explored in improving CNNbased object detectors hyperparameters. In this work, we propose the use of Blackbox optimization methods to tune the prior/default box scales in Faster RCNN and SSD, using Bayesian Optimization, SMAC, and CMAES.
We show that by tuning the input image size and prior box anchor scale on Faster RCNN mAP increases by 2% on PASCAL VOC 2007, and by 3% with SSD. On the COCO dataset with SSD there are mAP improvement in the medium and large objects, but mAP decreases by 1% in small objects. We also perform a regression analysis to find the significant hyperparameters to tune.
Keywords:
Object Detection, Scale Tuning, BlackBox Optimization, Hyperparameter Tuning1 Introduction
Object detection deals with classifying and localizing objects of interest in a given image. In the recent years, a great deal of research has been done in object detection. Moreover, it has multiple application domain such as autonomous cars, anomaly detection, medical image analysis, video surveillance, and so on. Advancements in Convolutional Neural Networks (CNN) have taken deep learning based object detection a step forward as they have proved to outperform traditional computer vision methods in benchmark datasets like MSCOCO [19] and PASCAL VOC [6].
The ability of CNN architectures to represent image highlevel features is one of the reasons for remarkable performance in stateoftheart object detectors [35]. However, performance depends heavily on the selection of various hyperparameters that guide and control the learning process. Every object detection methods has several hyperparameters like input image dimensions, size and scales of prior/default anchor boxes, multitask loss weights, number of output proposals, in addition to conventional neural network hyperparameters like learning rate, momentum and decay rate. The right choice of hyperparameters is essential because it plays a significant role in the model’s performance [7].
Hyperparameter tuning is challenging as one needs to choose the right settings from the highdimensional search space efficiently. Domain experts have insight on setting hyperparameters. Also, they conduct many experiments and choose values after many trial and error runs. Furthermore, hyperparameters depends on the dataset as values which work fine for one dataset may not provide the same performance on a different one [13]. Hyperparameter diversity is also problematic as they can be binary, categorical, continuous and conditional [13].
The current growth of the machine learning field has created a need for automating this laborious process by avoiding human intervention. Automated machine learning (AutoML) [14] is a newly emerging field which aims to automate the entire machine learning process. Besides AutoML, Blackbox optimization methods can also be applied for the task of hyperparameter optimization. Most basic methods are grid search [28] and random search [1]. However, both methods take a considerable amount of time and are computationally expensive.
Guided search can reduce the computational complexity and the time taken to find the right set of hyperparameters. Bayesian optimization (BO) [27] is a guided way as it considers prior information. It has proved to achieve better results with fewer computations when compared to Grid [28] and Random search [1] in image classification, speech recognition and natural language modeling [32][33][25][5]. Also, populationbased approaches, namely Genetic Algorithms, Genetic Programming, Evolutionary Strategy, Evolutionary Programming and Particle Swarm Optimization have shown remarkable results [4].
All the previously discussed methods have not been applied in tuning object detection hyperparameters. These are generally tuned manually by trial and error by the community, or used without tuning as they are defined by their authors, which might be suboptimal for a specific dataset.
This work intends to study the applicability of Blackbox optimization methods such as Bayesian Optimization [27] and Covariance Matrix Adaptation Evolution Strategy (CMAES) [11] for tuning object detection hyperparameters, in particular the anchor/default box scales. This will allow the detector to be specifically tuned to a particular dataset, instead of being handtuned and producing suboptimal performance.
To validate the proposed approach, blackbox hyperparameter optimization methods are used to optimize the SingleShot MultiBox Detector [21] and Faster RCNN [30] detectors on a variety of datasets to achieve the best performance. We find that the Blackbox optimization is able to improve mAP performance and achieve better results than the handtuned configurations in most of the cases.
The contributions of this work are: we propose the use of blackbox optimization methods to tune the prior boxes of Faster RCNN, and the default boxes of SSD. We show that by using these methods, performance in terms of mAP on PASCAL VOC and MSCOCO increases by around 13%. We also show that the scales learned with blackbox optimization transfer from PASCAL VOC 2007 to VOC 2012, with an mAP improvement as well, and we perform a regression analysis to find out which are the most important hyperparameters to tune.
It should be noted that the objective of this work is not to improve or beat state of the art detectors in many datasets, but to show the importance of automatic hyperparameter tuning for object detection, in particular to tune the scales and input image size automatically, without a manual process. Our aim of this paper is important for realusecases trying to use state of the art object detectors in novel datasets, specially for nonexperts in computer vision.
2 Related Work
2.1 BlackBox Optimization in Deep Learning
Blackbox optimization methods are widely used in tuning hyperparameters of deep learning algorithms. Bayesian optimization with Gaussian processes (BOGP) was first used to optimize hyperparameters of an image recognition deep learning architecture on CIFAR10 [16] and was able to achieve 3% increase over the state of the art in 2012 [32]. A new approach in Bayesian Optimization, called as Deep Networks for Global Optimization (DNGO) [33] uses neural networks instead of Gaussian processes (GP) as a surrogate model for fitting distributions over the objective functions. Bayesian optimization with DNGO also supports parallelism for hyperparameter optimization. DNGO was used to tune the hyperparameters of various deep learning problems such as image classification and image caption generation.
In the work presented in [24], CMAES is compared with Bayesian optimization for hyperparameter optimization of deep convolutional neural networks for MNIST classification problem. The Particle Swarm Optimization (PSO) algorithm has also proved beneficial for a hyperparameter optimization problem. In the work presented in [22] and [23], PSO is used for optimizing the hyperparameters of the Deep Neural Network (DNN) in parallel and quickly on the CIFAR10 dataset. [7] has discussed various notable strategies in hyperparameter optimization. However, these methods have not been explored much in tuning the hyperparameters specific to deep learningbased object detectors. In this work, we focus on the area of automatic tuning of object detector hyperparameters using blackbox optimization methods.
2.2 Object Detection Hyperparameters Selection
The designing of prior/default anchor boxes is a big challenge in object detection. The design process completely depends on the size and ratio of objects to be detected in a particular dataset. [36] proposed an approach to dynamically adapt the design of anchor boxes using the gradients of the loss function. The anchor box optimization method was integrated into YOLO v2 [29] and had obtained a 1% mAP gain in MSCOCO [19] and PASCAL VOC [6] datasets.
In YOLOv2 [29], kMeans clustering is used to determine the prior anchor boxes. The ground truth bounding boxes in the training set are clustered based on the Intersection Over Union (IOU) scores instead of the conventional Euclidean distance, as it produces error for larger boxes than smaller ones. Finally, the cluster centroids are used for designing the anchor boxes. Our approach has an advantage over these methods as it is not constrained only to anchor boxes and can include other object detection hyperparameters.
3 Proposed Approach
In our experiments we use Bayesian optimization and CMAES for tuning the object detection hyperparameters. In this section we will briefly discuss about Bayesian Optimization [27] and CMAES [11].
3.1 Bayesian Optimization
Bayesian optimization (BO) [27] is an efficient and effective method for global optimization problems. Bayesian optimization over the years has also evolved as a prominent solution for hyperparameter optimization problems. In Bayesian optimization, initially a random set of hyperparameters is evaluated, and a probabilistic surrogate model is fit to this data ’D’. This probabilistic surrogate model is used by an acquisition function to compute a utility score for a different set of hyperparameters. Then the hyperparameters with a better score will be evaluated on the actual objective function, and the evaluation result is used to update the probabilistic surrogate model. The optimization process goes in iteration until a computational budget is met.
In a nutshell, the two main components of the Bayesian optimization are the probabilistic surrogate model and acquisition function. The probabilistic surrogate model is the Bayesian approximation of the actual objective function to derive samples efficiently. The acquisition function implies a tradeoff between exploration and exploitation. We chose Expected improvement (EI) [26] over other acquisition function like Probability of Improvement (PI) [17] and Gaussian process upper confidence bound (GPUCB) [15] as EI has proved to have the best convergence rates [2].
In our experiments, we use Gaussian Processes (GP) based BO, and Random Forests based BO, which is called Sequential Modelbased Algorithmic Configuration (SMAC) [12].
Figure 0(a) shows the flowchart of the BO pipeline for object detection used in our experiments. We use the SMAC3
3.2 Covariance Matrix Adaptation Evolution Strategy
Covariance Matrix Adaptation Evolution Strategy (CMAES) [11] is also used for continuous hyperparameter tuning. CMAES is an evolutionary algorithm used for derivativefree optimization of nonlinear, nonconvex continuous objective functions. The main advantage of CMAES over other evolutionary algorithms is that it can behave well with small population sizes.
In a nutshell, CMAES generate samples from a multivariate normal distribution and evaluates these samples in the objective function to obtain a fitness value for every sample. Based on the fitness of the samples, the multivariate normal distribution is adjusted to generate new samples in the next generation (iteration). This process is repeated until a best sample is found or a computation budget is met.
Figure 0(b) shows the CMAES pipeline flowchart for object detection used in our experiments. We use the pycma
3.3 HyperParameter Space Design
In this section we describe the hyperparameter space design for two selected object detection models.
We chose these two detectors since they perform detections using different approaches and could validate the generality of our approach. Specifically, we chose SSD and Faster RCNN which are defined respectively as onestage and twostage detectors. There are more recent detectors based on these designs, but we wanted to evaluate the original versions as to separate the effect of other improvements with hyperparameter tuning.
Faster RCNN [30] [31] uses a Region Proposal Network (RPN) for identifying regions of interest which replaces the selective search algorithm in RCNN [8] and Fast RCNN [9]. The RPN is more computationally efficient and also has a better detection performance. In Faster RCNN, the image is fed into a pretrained CNN to produce a convolutional feature map. This feature map is then used by the RPN to compute region proposals. Similar to Fast RCNN, these region proposals are filtered by ROI pooling and fed into a fully connected layer for classifying objects.
Region proposals are obtained by sliding a window over the convolutional feature map. A set anchor boxes with different scales and ratios are used at each feature map location. In total there are anchors for a feature map of height and Width . The original implementation of this model uses nine anchor boxes with three different ratios and scales . Figure 3(a) shows the 9 anchor boxes at point for an image of size . The authors of Faster RCNN do not mention how the anchor box parameters should be set. However, they perform an ablation study that varies the number of scales and ratios, indicating that 9 anchor boxes seems to be optimal [31]. Performance of the detector is shown to be sensitive to the input image size, scale and ratio of the anchor boxes used. The input image size is defined as the pixel size of an image’s shortest side.
For this detector we consider three sets of hyperparameters: The input image size, and the anchor box scale and ratios. In order to reduce the number of assumptions, we take all hyperparameters to be continuous, with a given transformation into the original parameters of Faster RCNN, as specified by Table 1.
Hyperparam  Range  Scaled Value  Type  Transform  Pixel Value 

Input Image Size  0.6  C  600  
Anchor Ratio 1  0.25  C  0.5  
Anchor Ratio 2  0.5  C  1  
Anchor Ratio 3  1  C  2  
Anchor Scale 1  0.25  C  128  
Anchor Scale 2  0.5  C  256  
Anchor Scale 3  1  C  512 
Single Shot MultiBox Detector (SSD) [21] is a singlestage object detector. This architecture uses feature maps obtained from the backbone network VGG to construct additional convolutional layers corresponding to different anchor scales and ratios. Figure 7 shows the prior anchor boxes with varying scales and ratios on each convolutional layer.
Each feature map location predicts the offsets relative to the multiple prior anchor boxes along with the confidences for each class. The loss function is computed by matching the prior anchor boxes with the ground truth boxes using the IOU score. Hence, designing prior boxes scales and ratios is essential for having better alignments of these prior boxes to each position of the feature map. The design of prior boxes depends on the size and ratio of objects in the dataset. In SSD, the scale sizes are determined using Equation 1:
(1) 
Where represents the scale size at feature map and denotes the number of feature maps. The minimum scale size is , and maximum scale size is . The conv prior box scale is fixed to . There are four nonsquare anchor ratios and one square ratio () for each scale. There is also a constant box with anchor ratio 1 whose scale is computed using . The width and height of the default boxes are computed as:
(2) 
Some feature map skip the anchor ratios. While training SSD with the MS COCO dataset, the authors reduce the minimum scale and conv prior box scale to tackle the smaller objects in MSCOCO dataset.
For this detector, we decided to tune the scales and treat them as continuous hyperparameters, as shown in Table 2. All other hyperparameters’ values are kept from the original implementation.
Hyperparam  Range  Type  Feature map  Anchor Box  

Width  Height  
Scale 0 ()  C  Conv 4_3  
Scale 1 ()  C  Conv 7(fc7)  
Scale 2 ()  C  Conv 8_2  
Scale 3 ()  C  Conv 9_2  
Scale 4 ()  C  Conv 10_2  
Scale 5 ()  C  Conv 11_2  
Scale 6 ()  C  NA  NA  NA 
In the following two sections we show our main experimental results. We evaluate Faster RCNN and SSD on the PASCAL VOC 2007/2012 and MS COCO datasets, and learn hyperparameters using CMAES and Bayesian Optimization.
4 Experiments on Faster RCNN
The training and validation split of PASCAL VOC 2007 is used for training, and the fitness score (mAP) is computed on the test split of PASCAL VOC 2007. Table 1 in the supplementary shows the CMAES parameters used in this experiment. Training and evaluation of Faster RCNN for all hyperparameters of a particular generation were done in parallel. We use the Faster Pytorch Implementation
Figure 3 shows the performance of Faster RCNN optimized using CMAES for 25 generations (225 evaluations) on PASCAL VOC 2007 dataset. This setup achieves an mAP of 71.78 % which is 1.79% more than the default hyperparameters (69.9% mAP). Figure 4 shows the comparison of prior boxes in the original implementation of Faster RCNN with the prior boxes found using CMAES. Table 4 shows an overall comparison of all our results along with the associated anchor scales and aspect ratios.
4.1 Regression Analysis
An important question is to determine which are the best scales to tune, as some scales might be more important than others. To answer this question, we performed regression analysis between the object detector performance, as measured by mAP, and the individual hyperparameters. Regression analysis can explain the relationship between a dependent variable and many independent variables, and can also explain the significance of the independent variables.
For this we normalize the mAP and all hyperparameters to the range, and trained a linear regression model. The coefficients of this model can then be interpreted as importance scores, which should tell us about which hyperparameter is more important. We measure the goodness of fit using the coefficient of determination , which indicates the amount of variance explained by the independent variable (the mAP).
Table 3 shows results of the regression analysis with Faster RCNN hyperparameters. The higher coefficient value of image scale indicates that tuning the image scales in Faster RCNN is very important, but the highest coefficient is associated to the input image size (), indicating that it has the highest effect on mAP. Anchor Scales one and two have a significant coefficient value, indicating their importante, which we believe makes sense as they have the biggest receptive field size. Adjusting the ratios () seems to have little impact on the overall mAP.
Method  Coefficients  









CMAES  0.52  0.67  0.25  0.19  0.02  0.02  0.01  0.07  

0.21  0.32  0.15  0.01  0.01  0.02  0.08 
Settings  Anchor Scales  Aspect Ratios  mAP(%) 

Default  0.5, 1, 2  69.9  
BOGP  0.941 , 1.155, 2.015  71.37  
SMAC  0.4, 0.5, 1  71.56  
CMAES  0.259, 0.964, 1.741  71.78 
5 Experiments on SSD
5.1 Prior Boxes Optimization using CMAES
Quality of optimization using CMAES highly depends on setting proper initial parameters. We set the population size for PASCAL VOC 2007 based on [11]. For MSCOCO, we use a smaller population size based on our GPU resource availability. CMAES parameters that we use are shown in Table 2 in the supplementary. We set the initial scales according to Equation 1, with values . Mean average precision (mAP) is used as a fitness score for the individuals. We use a high quality implementation [18] of SSD
Anchor Scales on PASCAL VOC 2007. We use the training and validation split of PASCAL VOC 2007 for training, and the fitness score (mAP) is computed on the test split. We evaluate hyperparameter configurations by training SSD on nine different Nvidia TitanXP GPUs. The training and evaluation of a single hyperparameter configuration on SSD takes around 24.5 hours. Hence, the evaluation of one generation can be completed in a single day if it is done in parallel.
Figure 4(a) shows the performance of SSD optimized using CMAES for 25 generations (225 evaluations). We can clearly see performance improvements over generations. The median mAP also increases with each generation, depicting the autotuning of anchor scales. Figure 6 shows the box plot of each anchor scale values over generation, and it illustrates the convergence of each anchor scale towards the best value. The best anchor scale configuration found using CMAES has achieved a mean average precision of 71.55%, which is 3.55% higher than the default scales (68.0 % mAP).
Anchor Scales on MS COCO. We also evalaute on MS COCO using CMAES. For this dataset, the authors of SSD reduce the minimum scale and prior box scale to tackle the smaller objects in the MSCOCO dataset as 5 and respectively. Instead of using the modified scales as an initial scales for CMAES, we used the regularly placed scales according to Equation 1, and the constant box scale of to analyze the performance of CMAES for a dataset with a lot of smaller objects. We use the trainval35k split of MS COCO 2014 for training, and the fitness score (mAP) is computed on the minival split. The training and evaluation of a single hyperparameter configuration took around 44 hours.
Figure 4(a) shows the performance of SSD optimized using CMAES for 23 generations (92 evaluations). We clearly see the increase in fitness over generations by tuning the anchor scales using CMAES. The anchor scales found using CMAES adapted well as it is able to find smaller scales, appropriate for a dataset with many small objects. However, the anchor scales found by CMAES achieve 24.6% mAP which is 0.5% less in comparison with the original anchor scales. The reason for this is apparently we used a small population size due to GPU resource constraints. With a higher population size, CMAES should be able to find a better anchor scale as it had proved to optimize the scales with an increase in generations with even small population size. Table 5 shows the detailed mAP results, with improvements in AP for medium and large object sizes
Settings  Avg. Precision (AP)  Avg. Precision (AP)  
IOU  0.5:0.95  0.5  0.75  Area  Small  Medium  Large  
Default  25.1  43.1  25.8  6.6  25.9  41.4  
CMAES  24.6  42.8  25.1  4.8  26.3  42.9 
5.2 Prior Boxes Optimization using Bayesian Optimization
We also experiment with Bayesian Optimization to tune the prior anchor box scales of SSD for PASCAL VOC 2007. We set the same hyperparameter space of the anchor scale as shown in Table 2. Expected Improvement (EI) is used as the acquisition function in both BOGP and SMAC. The experiment was carried out sequentially with computation budget of 75 function evaluations. Our results are presented in Figure 7. Though both BOGP and SMAC achieve almost the same result best mAP, BOGP took less function evaluations when compared to both SMAC and CMAES. However, after finding the best scales, the performance of many anchor scales generated by BOGP are not satisfactory, which is indicated by the plot scattering. This is because the optimization algorithm might try to increase its exploration space in search of better hyperparameters.
A summary of our results is shown in Table 6, including the associated anchor scales. We include a baseline where we obtained anchor scales using kmeans on the training set, similar to YOLOv2 [29]. There was only a small improvement by using kmeans scales, and it is outperformed by all hyperparameter tuning methods.
Method  Anchor Scales  mAP  










Default  0.1  0.2  0.37  0.54  0.71  0.88  1.05  68.0  
BOGP  0.0973  0.2132  0.3689  0.5224  0.5882  1.038  1.051  71.55  
SMAC  0.1037  0.1935  0.3707  0.5542  0.6934  0.9481  1.0481  71.53  
CMAES  0.0908  0.1892  0.3276  0.5240  0.6865  0.8390  0.8894  71.50  
kMeans  N/A  N/A  N/A  N/A  N/A  N/A  N/A  68.35 
5.3 Generalization Analysis
So far we have optimized object detector scales based on its evaluation score on PASCAL VOC 2007 test dataset. Hyperparameter optimization may also overfit to the evaluation dataset [3]. To verify the generalization of the learned scales, we tested the scales learned on PASCAL VOC 2007 on a different dataset, namely the PASCAL VOC 2012 validation set. As seen in Table 7, it is evident that the optimized models still able to perform better than the handtuned scales on the PASCAL VOC 2012 validation set. Table 7 indicates that SMAC has improved generalized performance in comparison with other methods.
Dataset  Default  BOGP  SMAC  CMAES  kMeans 

VOC 2007 Test  68.0  71.55  71.53  71.50  68.35 
VOC 2012 Val  61.74  63.54  64.43  63.63  61.91 
5.4 Regression Analysis
Similarly to Section 4.1, we performed regression analysis to find the importance of each anchor scale. We used the absolute of the coefficient to find the most important hyperparameter. Table 8 shows our results. In BO and SMAC, we can see anchor scale zero and one have the largest coefficients. Moreover, both BOGP and SMAC agrees that anchor scale 0 is a very important hyperparameter, and the high score also indicates a good fit, validating that this scale is probably the most important to tune.
The higher score of regression from SMAC results indicates a good explanation of the mAP given the anchor scales. In CMAES, anchor scales two, three, and four have the largest coefficients, but the low score makes this relation not significant.
It is interesting that the explainability of the mAP given the scales varies considerably with the hyperparameter optimization methods. One would imagine that there should be no such relation, but each method guides sampling on the hyperparameter space in a different way, which might introduce certain biases.
Method  Coefficients  









CMAES  0.15  0.15  0.04  0.38  0.50  0.28  0.03  0.05  
BOGP  0.66  0.54  0.31  0.12  0.04  0.02  0.00  0.22  
SMAC  0.82  0.82  0.17  0.02  0.06  0.12  0.09  0.08 
6 Conclusions and Future Work
In this paper, we demonstrate the performance of Blackbox optimization for object detection hyperparameters, in particular the default box/anchor scales, on PASCAL VOC 2007 and MS COCO. From our experimental results, we can conclude that using Blackbox optimization produces an improvement on mAP by adjusting the anchor/prior box scales of Faster RCNN and SSD on the PASCAL VOC 2007 and MSCOCO datasets, and generally can achieve better results than the handtuned configurations in most of the cases. In MSCOCO we observed decreased performance for small objects, and only medium and large objects see an improvement in mAP.
GPbased Bayesian Optimization obtains better performance with less function evaluations. CMAES results show a clear view of the improvement in performance with an increasing number of generations. BOGP and SMAC can be studied more in with different initial designs.
We also evaluated the transferability of the learned scales with SSD on PASCAL VOC 2007, by evaluating these scales on the PASCAL VOC 2012 validation set, which also shows an improvement in mAP. We believe this shows that learning scales using these methods is a good alternative to manually designing the scales by a human. It is possible that the scales currently used by object detectors are suboptimal, in terms that each dataset has probably a different set of optimal scales. Using automatic tuning methods should help practitioners tune an object detector for a particular dataset with minimal effort.
Finally, for each combination we performed a simple regression analysis to find out the importance of each hyperparameter to the overall mAP. We find that for Faster RCNN, the biggest factor is the input image size, while for SSD the first scales contribute more to increasing mAP. This information is valuable for future research, as efforts can be concentrated on the most important hyperparameters, decreasing search time and computational costs.
Broader research is needed on larger hyperparameter spaces by including other object detection hyperparameters that were not used in our experiments, for example, it would be very interesting to tune not only the actual scales, but also the number of scales, which may vary with the dataset, in particular with MS COCO as it contains many small objects. The multitask loss weights are also not generally tuned, which could be a possible source of improvement. Multiobjective optimization is can be used to choose the hyperparameters that minimize prediction time while maximizing the task performance (higher mAP) of the object detector.
Appendix A Parameters of Optimization Methods
Dataset 








0.3  6  150  7  25 
Dataset 








0.3  9  225  7  25  
MSCOCO  0.3  4  92  7  23 
Appendix B Additional Details of PASCAL VOC Results
In this section we provide details on the perclass average precision for SSD and Faster RCNN with different hyperparameter optimization methods. These results are available in Table 11 for SSD, and in Table 12 for Faster RCNN.
In both cases we see modest improvements in class AP over the default hyperparameters, up to 4% in absolute, for example, for the horse class in Faster RCNN.

Default  BOGP  SMAC  CMAES  kMeans  

aero  73.4  75.51  76.03  74.4  76.41  
bike  77.5  80.17  79.19  78.86  75.65  
bird  64.1  67.24  69.13  67.24  63.92  
boat  59.0  62.69  66.49  63.80  62.46  
bottle  38.9  41.24  41.88  41.45  39.13  
bus  75.2  82.72  80.02  79.20  76.19  
car  80.8  83.51  83.21  83.07  82.03  
cat  78.5  82.76  81.56  84.28  78.36  
chair  46.0  52.48  53.18  52.13  48.82  
cow  67.8  76.78  77.37  76.28  74.36  
dining table  69.2  69.50  70.50  69.02  63.32  
dog  76.6  78.66  79.31  79.64  77.28  
horse  82.1  81.91  82.83  83.74  78.70  
motor bike  77.0  80.28  80.17  80.11  76.10  
person  72.5  75.37  74.91  74.98  69.83  
potted plant  41.2  44.62  43.57  44.47  41.42  
sheep  64.2  68.05  66.97  70.84  66.48  
sofa  69.1  73.67  71.49  71.45  70.09  
train  78.0  83.34  82.37  83.95  77.39  
monitor  68.5.5  70.56  70.45  71.03  69.03  
mAP  68.0  71.55  71.37  71.50  68.35 

Default  BOGP  SMAC  

aero  70.00  73.71  71.6  
bike  80.6  80.66  79.4  
bird  70.1  69.91  68.5  
boat  57.3  56.19  56.90  
bottle  49.9  57.11  57.06  
bus  78.2  76.94  80.88  
car  80.4  82.94  84.42  
cat  82.0  82.46  83.61  
chair  52.2  50.75  50.85  
cow  75.3  79.96  78.63  
dining table  67.2  70.1  70.47  
dog  80.3  79.27  80.65  
horse  79.8  83.21  82.56  
motor bike  75.0  75.48  76.69  
person  76.3  77.27  77.01  
potted plant  39.1  43.23  44.38  
sheep  68.3  71.33  71.22  
sofa  67.3  67.34  66.71  
train  81.1  77.17  76.59  
monitor  67.6  72.45  73.13  
mAP  69.9  71.37  71.56 
We also provide details on the anchor scales that we obtained using automatic machine learning.
Figures 8, 9, and 10 show a comparison of the learned anchor boxes with the original anchor boxes in SSD. The changes are very subtle, giving evidence that the original anchor boxes might be good enough for most use cases, but still there are large changes in the biggest scales (Figure 10).
Appendix C Additional Results on the Marine Debris Dataset
An important motivation behind this work is to apply hyperparameter optimization to real use cases that train object detectors in novel datasets. Considering this, we applied CMAES for tuning SSD anchor scales on a new marine debris dataset that we captured.
This dataset was created with the purpose of detecting and capturing debris with a marine surface vehicle. It is used to train object detectors, with preference for a detector with high mAP that runs in a resourceconstrained computer inside the surface vehicle. The object classes required for this task are not well represented in most object detection datasets, specially for marine scenes, so we decided to build our own dataset. This dataset is not public yet, it will be released in a future publication that we are preparing.
The images were collected and extracted from various online resources like TACO, the Open Images Dataset and Flickr Creative Commons Images, and some captured using a surface vehicle in a water tank. Annotation labels from TACO and Open Images are reused and modified to suit our use case. We manually annotated other images coming from Flickr. Figure 11 and 12 shows some of the images from the Marine Debris Dataset.
The dataset consists of 2849 images for training, and 1079 images for evaluation. We annotated the following classes with bounding boxes:

Marine Vehicles: Ship, Boat, One person vehicle

Humans: Swimmer

Marine Structures: Pier, Dock, Bridge, Miscellaneous Structure

Marine Animals: Marine Mammal

Debris: Bottle, Package, Miscellaneous Debris, Wooden Debris, Debris Patch
Dataset 








0.3  8  112  7  14 
c.1 Optimization of SSD Hyperparameters using CMAES on the Marine Debris Dataset
In this experiment, the prior anchor scales of SSD is optimized using CMAES. We use the same initial parameters and hyperparameter space as the SSD experiments on PASCAL VOC. Table 13 shows the CMAES parameters used in this experiment.
The hyperparameter configurations generated in each generation is evaluated parallelly by training SSD on eight different Nvidia TitanXP GPUs. The model is initially trained with learning rate for 20000 steps. Then, the training is continued with and learning rate for another 5000 steps, respectively. The batch size per training step is set to 32.
Method  Anchor Scales  mAP  










Default  0.1  0.2  0.37  0.54  0.71  0.88  1.05  51.47  
CMAES  0.04811  0.2257  0.3536  0.5765  0.7667  0.8600  1.03711  54.39 
Figure 13 shows the optimization of SSD anchor scales over generations. The best anchor scales found using CMAES achieves an mAP of 54.39%, which is 2.92% greater than the default scales. It is evident from Figure 14 that the anchor scale 0 tuned by CMAES is smaller than the default anchor scale 0. Also there is a significant change in the anchor scale 4.
Moreover, the perclass average precision shown in Table 15 confirms the tuning of anchor scales has improved the performance of all objects except marine vehicle class.
Method 





mAP  

Default  0.55  0.62  0.49  0.48  0.44  51.47  
CMAES  0.58  0.65  0.51  0.47  0.49  54.39 
Regression analysis. Similarly to previous experiments we perform regression analysis to find the important anchor scales. Table 16 shows that anchor scales zero and six have more impact on mAP when compared to other anchor scales.
Method  Coefficients  









CMAES  0.4  0.37  0.03  0.18  0.02  0.03  0.14  0.41 
Scales visualization. In Figures 14 and 15 we present a comparison the scales that have been learned in this dataset using CMAES with the original SSD scales. The differences are small but there are relevant changes, which connect to our regression analysis results in Table 16.
Footnotes
References
 (2012) Random search for hyperparameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §1, §1.
 (2011) Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research 12 (Oct), pp. 2879–2904. Cited by: §3.1.
 (2010) On overfitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11 (Jul), pp. 2079–2107. Cited by: §5.3.
 (2007) Evolutionary algorithms for solving multiobjective problems. Vol. 5, Springer. Cited by: §1.
 (2013) Improving deep neural networks for lvcsr using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8609–8613. Cited by: §1.
 (201006) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §1, §2.2.
 (2019) Hyperparameter optimization. See Automatic machine learning: methods, systems, challenges, Hutter et al., pp. 3–38. Cited by: §1, §2.1.
 (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §3.3.
 (2015) Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §3.3.
 (201902) CMAES/pycma on Github. Note: Zenodo, DOI:10.5281/zenodo.2559634 External Links: Document, Link Cited by: §3.2.
 (2016) The cma evolution strategy: a tutorial. arXiv preprint arXiv:1604.00772. Cited by: §1, §3.2, §3, §5.1.
 (2011) Sequential modelbased optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §3.1.
 F. Hutter, L. Kotthoff and J. Vanschoren (Eds.) (2018) Automated machine learning: methods, systems, challenges. Springer. Note: In press, available at http://automl.org/book. Cited by: §1.
 F. Hutter, L. Kotthoff and J. Vanschoren (Eds.) (2019) Automatic machine learning: methods, systems, challenges. Springer. Cited by: §1, 7.
 (2012) On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592–600. Cited by: §3.1.
 (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §2.1.
 (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86 (1), pp. 97–106. Cited by: §3.1.
 (2018) High quality, fast, modular reference implementation of SSD in PyTorch. Note: \urlhttps://github.com/lufficc/SSD Cited by: §5.1.
 (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §2.2.
 (2017) SMAC v3: algorithm configuration in python. GitHub. Note: \urlhttps://github.com/automl/SMAC3 Cited by: §3.1.
 (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, Figure 2, §3.3.
 (2017) Particle swarm optimization for hyperparameter selection in deep neural networks. In Proceedings of the genetic and evolutionary computation conference, pp. 481–488. Cited by: §2.1.
 (2017) Hyperparameter selection in deep neural networks using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1864–1871. Cited by: §2.1.
 (2016) CMAes for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269. Cited by: §2.1.
 (2017) On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589. Cited by: §1.
 (1978) The application of bayesian methods for seeking the extremum. Towards global optimization 2 (117129), pp. 2. Cited by: §3.1.
 (1975) On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pp. 400–404. Cited by: §1, §1, §3.1, §3.
 (2001) Design and analysis of experiments john wiley & sons. Inc., New York 1997, pp. 200–1. Cited by: §1, §1.
 (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2.2, §2.2, §5.2.
 (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §3.3.
 (2016) Faster rcnn: towards realtime object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. Cited by: §3.3, §3.3.
 (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1, §2.1.
 (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §1, §2.1.
 (2017) A faster pytorch implementation of faster rcnn. https://github.com/jwyang/fasterrcnn.pytorch. Cited by: §4.
 (2019) Object detection with deep learning: a review. IEEE transactions on neural networks and learning systems. Cited by: §1.
 (2018) Anchor box optimization for object detection. arXiv preprint arXiv:1812.00469. Cited by: §2.2.