Ellipse RCNN: Learning to Infer Elliptical Object from Clustering and Occlusion
Abstract
Images of heavily occluded objects in cluttered scenes, such as fruit clusters in trees, are hard to segment. To further retrieve the 3D size and 6D pose of each individual object in such cases, bounding boxes are not reliable from multiple views since only a little portion of the object’s geometry is captured. We introduce the first CNNbased ellipse detector, called Ellipse RCNN, to represent and infer occluded objects as ellipses. We first propose a robust and compact ellipse regression based on the Mask RCNN architecture for elliptical object detection. Our method can infer the parameters of multiple elliptical objects even they are occluded by other neighboring objects. For better occlusion handling, we exploit refined feature regions for the regression stage, and integrate the UNet structure for learning different occlusion patterns to compute the final detection score. The correctness of ellipse regression is validated through experiments performed on synthetic data of clustered ellipses. We further quantitatively and qualitatively demonstrate that our approach outperforms the stateoftheart model (i.e., Mask RCNN followed by ellipse fitting) and its three variants on both synthetic and real datasets of occluded and clustered elliptical objects.
=1
I Introduction
Detection of ellipselike shapes [1] has been widely used in various image processing tasks, for instance, face detection [2] and medical imaging diagnosis [3]. The above works all investigate the mathematical model of ellipse based on segmented edges, contours, and curvatures [4] from the image to identify different ellipses. However, such traditional methods for ellipse fitting highly rely on preprocessing (such as segmentation and grouping), and thus often fail to detect ellipsoid objects from the image, especially in complex environments. For fruit detection in modern orchard settings, as an example, the edge information of clustered fruits is not salient [5] and largely interfered by nearby obstacles and background scenes, such as leaves and branches (see Fig. 1).
Adapting convolutional neural networks (CNNs) for object detection [6] and instance segmentation (e.g., Mask RCNN [7]) to this canonical task is a promising way to extract object information. Directly fitting an ellipse on the output mask (i.e., Mask RCNN in Fig. 1), however, fails to infer the entire object shape. Intuitively, there are two ways to modify the Mask RCNN model for predicting ellipses: (1) adding a regression model right after the mask branch; (2) performing regression directly on the features from RoiAlign. As we show in the ablation study (see Fig. 6 and Fig. 9), these two variants both lose the ellipse orientation information, while our Ellipse RCNN by injecting the learned whole object information (see Fig. 1), achieves the best performance, especially in occluded and clustered scenarios. Moreover, the detected ellipses can be further exploited for 3D localization and size estimation of such ellipsoid objects, while the bounding boxes are not reliable in terms of 2D object representation due to insufficient geometric constraints from multiple views [8] (see Fig. 2).
Our goal is to accurately detect and represent the elliptical objects in a 2D image as the input for 3D localization, and successfully infer the whole information of each object in an occluded and clustered scenario. Our approach is motivated by two key ideas. Common elliptical objects, such as apples, oranges and peaches, can be modeled as ellipsoids in 3D, and their back projections on the 2D image should be ellipses. The detection mechanism, just as that in the human brain, should be able to retrieve the whole elliptical objects by focusing on their partially visible boundary information so as to handle different occluded patterns effectively.
Our main contributions are twofold:

We propose a robust and compact ellipse regression model that detects each individual elliptical object and parameterizes it as an ellipse. The proposed regression method is general and flexible enough to be applicable to any stateoftheart detection model, in our case, a Mask RCNN detector [7]. For better accuracy, the proposed feature regions before ellipse regression are refined by bounding box estimation and feature padding. The correction of the regression strategy is validated cross different synthetic datasets containing wellseparated and clustered ellipses, respectively. We further analyze the improvement of our ellipse regression in an ablation study using the FDDB dataset [10].

For better handling occlusion, we integrate the UNet [11] structure into the detection model to generate decoded feature maps that contain retrieved hidden information. We further propose to learn various occluded patterns such that the detection confidence score is computed by generalizing the occlusion information between the visible part and the whole estimated ellipse. In ablation experiments, we demonstrate that our approach indeed improves the detection performance compared to the Mask RCNN baseline and its three variants using both synthetic and real datasets of occluded and clustered objects. Moreover, in heavily occlusion settings, our approach achieves the bestreported performance on the datasets.
To the best of our knowledge, this paper is the first work developing CNNbased model to detect objects as ellipses and to predict ellipse parameters in one shot directly from the image, and it is the first attempt to handle occlusion from the perspective of ellipse representation.
Ii Related Work
Since we develop the Mask RCNN model as our base object detector to predict ellipse parameters in occluded cases, we review recent work on CNNbased object detectors, 3D object localization, and occlusion handling, respectively.
CNNBased Object Detectors. Recent success in the general object detection tasks on Pascal [12], ImageNet [13], and MS COCO datasets [14], have been achieved by both singleshot [15, 16] and RCNN [17, 6, 7] architectures. The singleshot methods formulate object detection as a singlestage regression problem to predict objects extremely fast. The RCNN approaches by integrating region proposal and classification, have greatly improved the accuracy, and are currently one of the best performing detection paradigms.
Fruit detection, as a challenging example of elliptical object detection, recently has attracted intensive interests in machine vision and agricultural robotics [18, 19, 5]. Although recent works [20, 21] by tunning head layers of RCNN models, have presented a good performance on wellseparated fruits, the detection accuracy drops significantly as fruits cluster and occlude each other. The authors in [22] further report a comparative study of CNNbased models on various apple datasets. In modern orchards, fruit occlusions happen frequently due to the environmental complexity [23], especially when they are occluded by neighboring leaves and branches. It is even more challenging to estimate the 3D size and the 6D pose (position and orientation) of each individual fruit in such occluded and clustered cases.
Object Localization from 2D Detection. Recent research has been developed on the size and pose estimation from object detectors by modeling objects as quadrics in 3D [9, 8], but the whole object information in heavily occluded cases is hardly retrieved from a bounding box that is derived from a little portion of the visible part (see Fig. 2). While some efforts have been made for 3D fruit localization [24, 25, 26, 27] by utilizing standard mapping techniques [28, 29], none of them could perform object size estimation because of lowresolution 3D reconstructions. Moreover, all these works represent visible object parts using bounding boxes, which are not appropriate for further estimating the object size and pose due to the spatial ambiguities in rectangle constraints of the ellipsoid. Specifically, fitting an inscribed ellipse for each bounding box [9] is not reasonable, since there could be infinite solutions for ellipse parameters that all satisfy the bounding box constraints [8] even from multiple views. We thus propose ellipse representation by developing the Mask RCNN architecture as the baseline, and compare the detection accuracy in terms of masks generated from ellipse parameters.
Occlusion Handling. One of the most common applications that apply occlusion handling strategy is pedestrian detection. The partbased methods [30] propose to learn a set of specific manually designed occlusion patterns, in which either handcrafted features [31] or predefined semantic parts [32] are employed. The drawback is that each occlusion pattern detector is learned separately, and it makes the whole procedure complex and hard to train. In contrast, our approach does not require predefining occlusion patterns. Moreover, pedestrians are shown vertically in the image and represented by bounding boxes, whereas the orientation of elliptical objects when described as ellipses cannot be predetermined and thus increases the complexity of learning patterns. We incorporate estimated ellipse parameters to learn a continuous vector that serves as a reference to generalize different learned patterns. The recent work [33] learns implicit information to infer the 6D pose of the object by using an autoencoder structure. In contrast, we integrate the UNet structure to retrieve occluded information in decoded feature maps.
Iii Overview of BoundingBox Regression
Most singlestage and RCNN based networks formulate the object detection as a regression problem, which outputs a classspecific bounding box for each prediction [15, 16, 34, 17, 6, 7]. In our case (see Fig. 3), the input to the RCNN models is a set of training pairs , where denotes the pixel coordinates of the center of a region proposal together with its weight and height in pixels. The groundtruth (GT) bounding box corresponding to is defined in the same way as . For example, in Mask RCNN, proposals are generated by a region proposal network (RPN) [6] after the input image going through the base net (i.e., ResNet101 [35]). The feature maps for each proposal are cropped from the top convolutional feature maps generated by feature pyramid network (FPN) [36] according to different proposal scales. The following RoiAlign layer [7] reshapes the cropped features to produce feature maps of the same size per proposal for classification and boundingbox regression.
The regression goal is to learn a transformation that maps a proposed region to a groundtruth box . Instead of directly predicting the absolute coordinates and sizes of a bounding box, the model learns to estimate the relative offset parameters , , and to describe how different the proposal is compared to the groundtruth (we drop the superscript for simplicity):
(1)  
where is the regression target, and denotes the predicted box that is recovered from . For the object likelihood (of classes), the model considers the background (i.e., absence of objects) as another class, and predicts the confidence scores of classes. Specifically, for fruit or face detection, there exists only one class of interest () such that only two values are necessary for the output per proposal with the class determined by the highest one.
There are two key benefits of predicting relative offset parameters in Eq. (1) for accurate boundingbox regression:

All four parameters of each bounding box are normalized such that the objects even with hugely different sizes contribute equally to the total regression loss, which also means that the loss is unaffected by the image size.

The normalization guarantees that all predicted values are close to zero (with small magnitudes) when the proposed region is near the groundtruth box, which stabilizes the training procedure without outputting unbounded values.
Iv Proposed Ellipse Regressor
Our key idea for ellipse regression is to infer relative offset parameters directly from visible parts so as to maintain the two key benefits described in Sec. III. By further leaning occluded patterns, the confidence score of visible parts of an occluded object is leveraged from incorporated information of its estimated ellipse. Since Mask RCNN [7] obtains the stateoftheart results in general object detection and instance segmentation, we exploit its base model as our frontend network (see Fig. 1).
Iva Formulation of Ellipse Feature Regions
In geometry, a general ellipse oriented arbitrarily (see Fig. 4) can be defined by its five parameters: center coordinates , semimajor and semiminor axes , (), and rotation angle (from the positive horizontal axis to the major axis of the ellipse). The canonical form of the general ellipse [37] is obtained as follows
(2) 
where the ellipse orientation is . We aim to train a regressor for predicting all five ellipse parameters, given a set of training pairs as the input to the ellipse regressor, where is the groundtruth ellipse characterized by and is denoted in the same way as in Sec. III. This can be thought of as ellipse regression from a proposed feature region to a nearby groundtruth ellipse.
However, the strategy of boundingbox regression cannot be directly applied to ellipse regression. The major challenge comes from how to accurately keep the ellipse orientation information in each feature region before the regression stage. For example, the RoiAlign layer [7] in the stateoftheart methods resizes the rectangular proposed regions of various shapes as squares of a fixed size, but this distorts the features maps and makes the prediction of the original ellipse orientation information unstable (see Fig. 4 for more details).
We therefore propose, before the resizing operation, to extend each rectangular feature proposal as a squared region , whose length only depends on the axes sizes of the ellipse to be predicted. The length of the extended square region is derived as follows. In Eq. (2), we take the derivative of with respect to :
(3) 
To determine the axisaligned bounding box for the ellipse, we equate the numerator and denominator of Eq. (3) to zero separately, since zero numerator and denominator correspond to horizontal and vertical tangents of the ellipse, respectively. The boundingbox length along each axis is solved as:
(4) 
We further create an extended square enclosing the ellipse bounding box, whose diagonal length is defined as the square length .
Thus, given a proposal closely bounding the ellipse , we extend it as the square sharing the same center with its length as . Besides no distortion of ellipse orientation, the other advantage of the extended feature region is that its size is still proportional to the ellipse size ( and ) but independent on the ellipse angle . It implies that even ellipses of the same size ( and ) but with different orientation angles will contribute equally to the regression loss (otherwise the different sizes of their axisaligned bounding boxes will weight inconsistently in the loss and make the regression model sensitive to the ellipse angle). Our extending strategy also addresses the issue in direct resizing methods (see Fig. 4): since the prediction of is coupled with the ellipse region shape (see and in Eq. (4)), the resizing step complicates the orientation learning process. The extended feature region thus serves as a stable reference to accurately predict ellipse offset parameters.
IvB Ellipse Offsets Prediction
Given a squared feature region extended from a proposal , our goal is to learn regressing features within to a set of relative offset parameters between and a groundtruth ellipse . We start from predicting elliptical objects without occlusion, and propose stable offset parameters to handle occluded cases.
Unoccluded Ellipse Prediction
For wellseparated objects, we parameterize the regression in terms of five outputs , , , and (superscript is dropped for simplicity):
(5)  
where the range of the GT ellipse angle is , is the ellipse regression target, and is the predicted ellipse calculated from . specifies the scaleinvariant translation of the center of to , while and specify the logspace translations of the size of to semimajor and semiminor axes of , respectively. is the prediction of the normalized orientation of . In such an unoccluded case (), the predicted offset values are all bounded when the proposed region () is located close to the groundtruth ellipse (see Fig. 5).
Occluded Ellipse Prediction
For occluded object detection, training RPN [6] to propose regions of visible parts (instead of whole object regions) highly reduces false positives as shown in Sec. V. We infer the whole elliptical object from its visible part through ellipse regression.
However, as the visible region goes small () and locates around the object boundary (, and ), the target values to learn (, , , and ) in Eq. (5) have unboundedly large magnitudes, which unstabilizes the training process. We thus propose to predict one more offset parameter for the scale (see Fig. 5):
(6)  
where , characterizing the visibility ratio calculated between the size of the extended square (from the visible part) and the length of the square enclosing the ellipse (i.e., the whole object region). By predicting , we transfer the offset reference from the visible part to the whole object region , which guarantees that all predicted values (with the target ) are bounded even in heavily occluded cases (when the proposed region is near the small visible part). Specifically, as and , Eq. (6) and Eq. (5) are equivalent, which means that Eq. (6) is a generalized formulation of ellipse offsets prediction that can handle both unoccluded and occluded cases.
After learning such offset parameters , we can transform an input extended region into a predicted ellipse by applying the transformation:
(7)  
where is rectified from such that .
Ellipse Regression Loss
For a proposed region , we define the regression loss as:
(8) 
where indicates that is positive (if the intersectionoverunion (IoU) overlap with its groundtruth box is higher than a ratio [6]), while if is nonpositive. is the robust loss function (smooth ) defined in [17], and is the transformation function defined as:
(9)  
which rectifies the ellipse orientation loss of compared to around critical angles (for example, the angle difference between and should be zero rather than ).
IvC Feature Region Refinement
Traditional RCNN based methods generate regression and classification outputs directly from proposed feature regions. However, relying on only roughly proposed feature maps from RPN maybe risky and errorprone especially to predict ellipse orientation in heavily occluded cases (see Fig. 6). Specifically, there exists a mismatch between a predicted visible region and its feature representation (see Fig. 7). Thus, our idea is to perform ellipse regression and classification based on the refined feature region output by a boundingbox regressor. This strategy alleviates the issue by allowing the model to exploit the features of the exact predicted visible region, which makes the inference output more reliable.
Based on the extended predicted region, the RoiAlign layer reextracts a small feature map (e.g., 1414), and accurately aligns the extracted features with the input from FPN. Features in the extended square but located outside the predicted visible region have a negative effect on predicting accurate ellipse parameters (see Fig. 6). To reduce the interference of such unrelated features, we perform zero padding on the extended feature area. Our proposed method is simple: we use floor and ceiling operations to compute the boundaries of the smallest rectangle that encloses the bilinearinterpolated feature map from the predicted region, and pad zeros in the rest area of the extended square. For example, and are two width limits of the resized rectangle whose center is assumed at , where is the width of the predicted region and is the resizing factor. The refined feature region leads to large improvements as we show in Sec V.
IvD Learning Occlusion Patterns
Diverse appearances of occluded objects lead to a large variety of occlusion patterns (see Fig. 1). Traditional networks are likely to assign a low confidence to an occluded object due to its hidden parts. Our key idea for occlusion handling is to employ channelwise attention in refined features by learning different occlusion patterns in one coherent model. Our model can leverage the prediction confidence of the visible part of an elliptical object based on the inference of the whole ellipse from the occlusion (see Fig. 8).
Occluded Ellipse Patterns
Given the refined features of a predicted visible region, we exploit a UNet [11] structure to learn the occluded ellipse shape within the extended square (see Fig. 8). The ground truth of the occluded ellipse shape is generated as follows. For an occluded object, we identify a bounding box of the visible part with its ellipse parameters. The GT whole ellipse generated is then cropped and resized by a predicted visible region, and put centered in the extended square. The GT visible ellipse is thus obtained without being occluded by other nearby obstacles. Unlike previous work [32, 40], our method does not relies on any particular discrete set of occlusion patterns or any external classifier for guidance, and thus can be trained in an endtoend manner.
By learning occluded ellipse patterns, the lowdimensional latent features encode both partial visibility and ellipse shape information [33]. Therefore, we perform ellipse regression directly from the latent features. The ellipse offsets are obtained via a multilayer perceptron (MLP) [41] (see Fig. 8).
Visible Part Attention
Many recent works [44, 45] find that convolution filters of different feature channels respond to their specific highlevel concepts, which are associated with different semantic parts. To leverage the detection confidence in occluded cases, our intuition is to allow the network to decide how much each channel should contribute in the refined features . Specifically, the channels representing the visible parts should be weighted more, while the occluded parts be weighted less. We thus reweight the refined features as :
(10) 
where is the attention weighting vector regressed from the latent features (learned partial visibility) by an MLP, and is the total number of channels (e.g., 256). The reweighted features is further regressed as a feature for classification.
Various ellipse orientations may increase the learning complexity of occlusion patterns. To compensate for the orientation effect, we propose to concatenate the feature with a latent feature (used for ellipse regression in Fig. 8) to incorporate both partial visibility and whole ellipse information. The concatenated feature thus learns various occlusion patterns, and passes through the classification head to output the final confidence scores.
Training Objective
The RCNN based models have two types of losses: RPN loss and head loss [6] (composed of classification loss and regression loss ). We redefine as the sum of the loss of the feature region refinement and the ellipse regression loss . On top of that, our occlusion handling introduces one additional loss defined as the average binary crossentropy loss. The loss function of the whole system can be written as follows:
(11) 
where the loss is over two classes (object vs. background), and the GT label is 1 if feature region (in total regions) is positive (as an object) otherwise is 0.
Methods  R  O  A  

Mask RCNN+  –  –  –  30.4  66.8  27.1  74.5  40.0  79.5  23.7  39.1  55.2  47.2 
Ellipse RCNN  37.1  73.2  36.7  67.9  30.1  68.0  34.2  50.6  44.0  38.3  
Ellipse RCNN  ✓  44.9  83.1  48.1  62.4  23.2  69.9  43.6  65.2  35.9  27.1  
DeepParts+ [32]  ✓  ✓  47.5  85.7  53.6  58.3  17.9  54.7  47.7  69.5  31.6  23.7  
SENet+ [46]  ✓  ✓  48.0  86.5  53.9  57.7  16.8  55.0  48.1  71.1  30.5  22.2  
Ellipse RCNN*  ✓  ✓  ✓  49.2  88.1  56.0  55.2  15.5  53.0  49.9  73.7  27.3  20.9 
Ellipse RCNN  ✓  ✓  ✓  51.5  89.9  58.9  53.8  13.3  50.9  52.3  76.8  24.7  18.5 
and : the ellipse IoU level starts from 0.75 to 0.95 with an interval 0.05. : the angle error decreases from to with an interval . : the error is from to with an interval . The default ellipse IoU for and is 0.75.
V Experiments
In this section, we first introduce synthetic and real datasets we use for the experiments, followed by a description of the implementation details and evaluation metrics. After that, we show experimental results of the ablation study for our Ellipse RCNN detector, and make a comparison to the state of the art. In the end, we demonstrate how Ellipse RCNN helps improve the accuracy of 3D object estimation in occluded cases.
Va Datasets
We validate the proposed Ellipse RCNN on four datasets: synthetic occluded ellipses (SOE), synthetic occluded fruits (SOF), real occluded fruits (ROF) and FDDB [10] datasets. Each elliptical object is annotated by its five ellipse parameters of the whole object region along with a bounding box of the visible part (except for the FDDB dataset as shown in Fig. 5).
The SOE dataset consists of 16,500 images in total, approximately 15,000 images are for training and the rest for testing. Synthetic images are generated from a cluster of different ellipses occluded from each other in the same distribution as in Fig. 6. The image background is randomly filled by the Pascal dataset [12] with randomly added triangles (simulating nearby obstacles) to further occlude ellipses (the visibility ratio of each ). To introduce more interference, ellipse colors are randomly generated in a roughly same tone as in real cases (e.g., clustered fruits and faces).
The SOF dataset contains 3,545 images (3,040 for training and 505 for testing) of a cluster of fruits occluded in a realistic tree (), which are generated by changing different poses and sizes of each model in Unreal Engine (UE) with the background randomly filled by images taken from different real orchards [27]. The GT ellipses are obtained by projecting the 3D fruit ellipsoids onto the corresponding images [47] based on camera poses.
Methods  R  O  A  

Mask RCNN+  –  –  –  25.7  59.7  20.3  78.3  46.6  84.1  25.6  43.0  27.6 
Ellipse RCNN  30.3  66.2  25.1  74.3  38.4  79.5  34.9  54.9  39.7  
Ellipse RCNN  ✓  35.4  73.8  33.9  69.5  27.2  74.3  46.1  64.3  53.7  
DeepParts+ [32]  ✓  ✓  38.3  79.6  36.2  65.9  23.4  70.4  52.2  70.9  59.2  
SENet+ [46]  ✓  ✓  38.9  78.9  36.9  65.4  22.9  71.9  53.5  71.8  61.3  
Ellipse RCNN*  ✓  ✓  ✓  40.0  81.4  37.8  64.2  21.2  68.0  54.2  73.7  63.7 
Ellipse RCNN  ✓  ✓  ✓  41.2  83.9  39.0  63.4  19.0  68.5  56.4  76.5  66.8 
: the angle error decreases from to with an interval . The default ellipse IoU for is 0.7.
The ROF dataset (1115 images in total) is humanannotated and is built upon MinneApple [48] and ACFR [20] datasets, from which we crop out the subimages of heavily occluded fruit clusters. We perform a similar trainingandtest split as in [20], which are composed of 900 images and 215 images, respectively. FDDB dataset [10] includes 2,845 images of 5,171 faces that are split by ten folds. Since most faces are wellseparated and only have GT ellipses (without GT visible boxes), we just demonstrate the generalization of our ellipse regressor on this dataset through 10fold crossvalidation.
VB Implementation Details
We use TensorFlow [49] to implement and train the Ellipse RCNN. For comparison, we directly use the source code of Mask RCNN provided by Matterport [50]. For the training, we use the pretrained weights for MS COCO [14] to initialize the Ellipse RCNN, and use a step strategy with minibatch stochastic gradient descent (SGD) to train the networks on a GeForce GTX 1080 GPU. On SOF, ROF, and FDDB datasets, we train with an initial learning rate of for 20,000 iterations and train for another 10,000 iterations with a decreased learning rate of . On the SOE dataset, we start with the same learning rate of , and then decrease the learning rate by 5 after every 20,000 iterations. The model converges at 50,000 iterations. During the training, we perform onthefly data augmentation with flipping, shifting, and rotation at random. We resize the ellipse and fruit images to 128128, while the face images are resized to 256256 in order to have face details in a higher resolution for training and testing.
Methods  R  O  A  

Mask RCNN+  –  –  –  33.4  59.8  34.2  69.6  44.1  69.2  20.0  34.8  20.6 
Ellipse RCNN  36.1  64.1  37.1  67.0  39.9  66.4  24.5  40.5  25.9  
Ellipse RCNN  ✓  40.5  68.6  42.0  63.5  34.7  61.6  31.0  49.7  33.2  
DeepParts+ [32]  ✓  ✓  43.2  74.5  44.4  61.4  29.2  59.2  34.8  53.6  37.6  
SENet+ [46]  ✓  ✓  43.9  74.0  46.5  60.9  28.5  59.7  36.9  55.1  39.7  
Ellipse RCNN*  ✓  ✓  ✓  44.6  76.4  49.2  59.1  26.3  56.2  38.6  57.8  42.8 
Ellipse RCNN  ✓  ✓  ✓  45.8  78.2  48.9  58.1  24.6  56.9  40.7  60.0  44.2 
: the angle error decreases from to with an interval . The default ellipse IoU for is 0.7.
VC Evaluation Metrics
Four evaluation metrics are exploited in all of our experiments: average precision (AP [14] over ellipse IoU thresholds), logaverage miss rate (MR) [51], and (AP and MR over ellipse angle errors). MR is the average value of miss rates for 9 FPPI (false positives per image) rates evenly spaced in the logspace ranging from to . By introducing and , we focus more on the accuracy of predicted ellipse angles. For example, we consider a prediction (evaluated by or ) as a false positive if its ellipse IoU is less than 0.75 (set as the default IoU) or its angle error is greater than . To clearly show the performance difference, we use a strict criteria: for instance, the IoU level starts from 0.75 up to 0.95 with an interval 0.05 (e.g., written as ), and the angle error decreases from to with an interval (e.g., written as ). We use AP and MR to measure the overall performance as they place a significantly large emphasis on localization and miss rate in occluded cases, respectively.
VD Performance of Ellipse RCNN
We compare the proposed Ellipse RCNN to the baseline model Mask RCNN, which obtains the stateoftheart results in general object detection and instance segmentation. Since our model is the first work of ellipse regression, to make a fair comparison, we fit ellipses directly from the mask outputs of Mask RCNN (trained on the regions of whole objects) using the method of minimum volume enclosing ellipsoid [52] in 2D (i.e., Mask RCNN+). We run a number of ablations to further analyze Ellipse RCNN. For the ablation study of occlusion handling, we adapt two stateoftheart methods in our model: DeepParts+ and SENet+. In DeepParts+, we only keep the UNet structure to learn a set of 45 occlusion patterns, and the final score is obtained via an MLP on the part detection scores [32]. For SENet+, we learn the attention vector directly from the refined feature maps (without UNet), and perform the classification only on the reweighted features [46].
Methods  Metrics  R  F1  F2  F3  F4  F5  F6  F7  F8  F9  F10  Avg. 

Mask RCNN+  –  59.0  56.9  60.0  61.4  56.7  58.2  59.3  59.3  55.1  61.1  58.7  
Ellipse RCNN  64.4  65.1  66.3  67.5  66.8  67.6  67.1  66.5  65.2  66.9  66.3  
Ellipse RCNN  ✓  68.7  70.9  71.4  72.2  72.7  74.0  73.0  71.8  71.9  71.1  71.8  
Mask RCNN+  –  21.2  21.0  18.5  18.6  17.6  14.8  20.0  18.2  22.2  18.8  19.1  
Ellipse RCNN  16.8  15.9  15.2  14.5  14.3  13.7  14.9  14.5  15.7  15.2  15.1  
Ellipse RCNN  ✓  11.2  11.9  11.1  10.2  10.6  8.4  9.6  9.8  11.5  10.6  10.5 
Accuracy of Ellipse Regression
The key component of our Ellipse RCNN is the ellipse regressor. Some examples of detected elliptical objects are illustrated in Fig. 9–12. Table I–II show the breakdown performance of the ellipse prediction on the SOE and SOF datasets whose GT is perfectly generated based on the geometry of object models. Our strategy of ellipse regression (e.g., ellipse RCNN) leads significant performance improvement on all metrics compared to the baseline model. Specifically, Table I shows that both and values of the proposed model are not sensitive to the increased levels of angle errors, which means that our strategy achieves a high accuracy of ellipse orientation prediction. We also observe that the Mask RCNN+ model trained on whole object regions (instead of visible parts) suffers from outputting more false positives due to the high similarities among the proposed feature regions (see Fig. 9–11). For the ROF dataset, Table III shows a higher sensitivity of our model on and compared to those on SOE and SOF datasets: is higher than that in Table II but drops a lot. The reason is that most humanannotated fruits are close to circles whose GT orientation information is noisy and inconsistent. Thus, it is hard to quantify the results on and but our proposed model still achieves the best performance on AP and MR.
Validity of Feature Region Refinement
Table I–III show the detailed breakdown performance of the proposed feature region refinement (i.e., Ellipse RCNN with R) on the SOE, SOF and ROF datasets. The performance is largely improved when the refined features are used for ellipse regression and classification. The improvements in and indicate that the refinement strategy is not only beneficial to increasing the accuracy of ellipse region prediction but also to reducing the false positives for classification, especially in occluded cases. However, Table IV shows smaller improvements if we apply the feature refinement strategy on the FDDB dataset: and are only improved by 5.5 and 4.6, respectively. As discussed in Sec. IVC, feature region refinement is used to remove the interference of nearby occlusions. Most faces in the FDDB datasets are wellseparated and there are few clustered and occluded cases. Thus, the improvements in Table IV by using the refined features are not as significant as those in Table I–III.
Performance of Occlusion Handling
One of our evaluation goals is occlusion handling, whose overall performance is measured by MR and as shown in Table I–III. All three variants with different mechanisms of occlusion handling show some improvements to the baseline (i.e., Ellipse RCNN with R), ranging from 2.1 to 7.2 on and from 4.3 to 8.6 on . Overall, the error rates can be sorted in the following order: DeepParts+ SENet+ Ellipse RCNN*. The reason is that the DeepParts+ is limited by its fixed number of occlusion patterns to learn, while the SENet+ learns a continuous attention vector to adjust feature weights but lacks the whole ellipse information to generalize different occlusions. We further compare our Ellipse RCNN to the Ellipse RCNN* (without concatenating ). The gap between them demonstrates that our concatenation of with is a more effective way of generalizing various occlusion patterns from ellipse predictions.
Generalization of Ellipse Regressor
In order to investigate the generalization ability of the proposed ellipse regressor, we also perform experiments on the FDDB dataset. Since no GT visible boxes are available and few objects are clustered, we can only evaluate our model without the occlusion handling mechanism (i.e., Ellipse RCNN with R). Focusing on the accuracy of orientation prediction, we show the results of 10fold crossvalidation in Table IV, where we can see that our model outperforms the Mask RCNN+ baseline by 13.1 on and 8.6 on , respectively. We also show some qualitative results in Fig. 12, where we can observe that our detector produces robust detections of ellipses even in some extreme cases. Specifically, in all seven examples, several faces are heavily occluded by the image boundaries. The Mask RCNN+ produces many distorted face shapes, while our detector accurately infers the whole ellipse regions for all of them.
Discussion on 3D Object Estimation
In order to understand how Ellipse RCNN improves the accuracy of 3D object estimation, we implement the multiview 3D localization using quadrics [9] from 2D detections on the SOF dataset. We compare our detector with the Mask RCNN and summarize the results in Table V. The evaluation metrics include rotation error [53], position error and relative size error in 3D that are averaged over all objects. For each UE setting (24 different settings in total), we select three images taken from different view angles to serve as the same inputs for both methods. As shown in the comparison, three estimation errors of the Ellipse RCNN are much lower than the Mask RCNN+, especially the rotation error (i.e., vs. ). This is because Ellipse RCNN better infers the whole region of each object directly from the visible part, thus is more effective in estimating the 3D pose and shape of objects from occlusion. More qualitative results are shown in Fig. 13.
Methods  Rot. Error ()  Pos. Error (cm)  Rel. Size Error 

Mask RCNN+  37.2  3.3  28.5% 
Ellipse RCNN  12.6  1.6  10.3% 
Vi Conclusion
This paper shows that traditional RCNN methods are not wellsuited for ellipse fitting since they only predict bounding boxes that have no orientation information for objects, and they are typically trained on the whole object regions in occluded cases. This makes those deep models suffer from outputting a large number of false positives and being unreliable to serve as the inputs for further 3D estimation of the object pose and its sizes. We thus propose the Ellipse RCNN to focus on the visible regions and infer the whole elliptical objects as ellipses from heavy occlusions. A robust ellipse regression is formulated to generalize both occluded and unoccluded case. Our model firstly learns various occlusion patterns of ellipses within the refined visible regions, then generates the final classification score by integrating the visibility information from an attention vector and the whole object information from the regressed ellipse. In this way, the model learns discriminative representations of occluded objects, which are robust in differently oriented scenarios. Extensive experimental results on two synthetic datasets and two real datasets demonstrate the advantages of our model compared to the Mask RCNN. The current approach for 3D object estimation weights equally each predicted ellipse parameter from 2D detections. Our future work would investigate predicting the uncertainties for all ellipse parameters to further boost the accuracy of the 3D object estimation system.
Acknowledgment
We thank our colleagues Nicolai Häni and Zhihang Deng from the University of Minnesota, for providing valuable feedback and technical support throughout this research.
Wenbo Dong: Ph.D. candidate 
Pravakar Roy: Doctor 
Cheng Peng: Ph.D. candidate 
Volkan Isler: Professor 
References
 Y. Xie and Q. Ji, “A new efficient ellipse detection method,” in Object recognition supported by user interaction for service robots, vol. 2. IEEE, 2002, pp. 957–960.
 S.C. Zhang and Z.Q. Liu, “A robust, realtime ellipse detector,” Pattern Recognition, vol. 38, no. 2, pp. 273–287, 2005.
 W. Lu and J. Tan, “Detection of incomplete ellipse in images with strong noise by iterative randomized hough transform (irht),” Pattern Recognition, vol. 41, no. 4, pp. 1268–1279, 2008.
 D. K. Prasad, M. K. Leung, and S.Y. Cho, “Edge curvature and convexity based ellipse detection method,” Pattern Recognition, vol. 45, no. 9, pp. 3204–3221, 2012.
 P. Roy and V. Isler, “Visionbased apple counting and yield estimation,” in International Symposium on Experimental Robotics. Springer, 2016, pp. 478–487.
 S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 1137–1149, 2017.
 K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask rcnn,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual quadrics from object detections as landmarks in objectoriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2019.
 C. Rubino, M. Crocco, and A. Del Bue, “3d object localisation from multiview image detections,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1281–1294, 2018.
 V. Jain and E. LearnedMiller, “Fddb: A benchmark for face detection in unconstrained settings,” 2010.
 O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241.
 M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
 W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
 R. Girshick, “Fast rcnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
 Q. Wang, S. Nuske, M. Bergerman, and S. Singh, “Automated crop yield estimation for apple orchards,” in Experimental robotics. Springer, 2013, pp. 745–758.
 C. Hung, J. Underwood, J. Nieto, and S. Sukkarieh, “A feature learning based approach for automated fruit yield estimation,” in Field and service robotics. Springer, 2015, pp. 485–498.
 S. Bargoti and J. Underwood, “Deep fruit detection in orchards,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 3626–3633.
 N. Häni, P. Roy, and V. Isler, “Apple counting using convolutional neural networks,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 2559–2565.
 N. HÃ¤ni, P. Roy, and V. Isler, “A comparative study of fruit detection and counting methods for yield mapping in apple orchards,” Journal of Field Robotics, 2018.
 W. Dong and V. Isler, “Linear velocity from commotion motion,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 3467–3472.
 J. Das, G. Cross, C. Qu, A. Makineni, P. Tokekar, Y. Mulgaonkar, and V. Kumar, “Devices, systems, and methods for automated monitoring enabling precision agriculture,” in 2015 IEEE International Conference on Automation Science and Engineering (CASE). IEEE, 2015, pp. 462–469.
 P. Roy and V. Isler, “Surveying apple orchards with a monocular vision system,” in 2016 IEEE International Conference on Automation Science and Engineering (CASE). IEEE, 2016, pp. 916–921.
 P. Roy, W. Dong, and V. Isler, “Registering reconstructions of the two sides of fruit tree rows,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1–9.
 W. Dong, P. Roy, and V. Isler, “Semantic mapping for orchard environments by merging twosides reconstructions of tree rows,” Journal of Field Robotics, 2018.
 C. Wu, “Towards lineartime incremental structure from motion,” in 2013 International Conference on 3D Vision3DV 2013. IEEE, 2013, pp. 127–134.
 R. MurArtal and J. D. Tardós, “Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
 M. Mathias, R. Benenson, R. Timofte, and L. Van Gool, “Handling occlusions with frankenclassifiers,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1505–1512.
 M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila, “Multicue pedestrian classification with partial occlusion handling,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 990–997.
 Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning strong parts for pedestrian detection,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1904–1912.
 M. Sundermeyer, Z.C. Marton, M. Durner, M. Brucker, and R. Triebel, “Implicit 3d orientation learning for 6d object detection from rgb images,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 699–715.
 J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
 R. Larson, Precalculus with limits: A graphing approach. Nelson Education, 2014.
 C. Y. Young, Precalculus. John Wiley & Sons, 2010, ch. 9.
 L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner, “dsprites: disentanglement testing sprites dataset (2017),” URL https://github. com/deepmind/dspritesdataset, 2017.
 C. Zhou and J. Yuan, “Multilabel learning of part detectors for heavily occluded pedestrian detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3486–3495.
 C. M. Bishop et al., Neural networks for pattern recognition. Oxford university press, 1995.
 S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
 V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 807–814.
 D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6541–6549.
 A. GonzalezGarcia, D. Modolo, and V. Ferrari, “Do semantic parts emerge in convolutional neural networks?” International Journal of Computer Vision, vol. 126, no. 5, pp. 476–494, 2018.
 J. Hu, L. Shen, and G. Sun, “Squeezeandexcitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
 W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S. Kim, and Y. Wang, “Unrealcv: Virtual worlds for computer vision,” in Proceedings of the 25th ACM international conference on Multimedia. ACM, 2017, pp. 1221–1224.
 N. Häni, P. Roy, and V. Isler, “Minneapple: A benchmark dataset for apple detection and segmentation,” arXiv preprint arXiv:1909.06441, 2019.
 M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for largescale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
 W. Abdulla, “Mask rcnn for object detection and instance segmentation on keras and tensorflow,” https://github.com/matterport/Mask_RCNN, 2017.
 P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 4, pp. 743–761, 2011.
 N. Moshtagh et al., “Minimum volume enclosing ellipsoid,” Convex optimization, vol. 111, no. January, pp. 1–9, 2005.
 W. Dong and V. Isler, “A novel method for the extrinsic calibration of a 2d laser rangefinder and a camera,” IEEE Sensors Journal, vol. 18, no. 10, pp. 4200–4211, 2018.