Ellipse R-CNN: Learning to Infer Elliptical Objectfrom Clustering and Occlusion

# Ellipse R-CNN: Learning to Infer Elliptical Object from Clustering and Occlusion

## Abstract

Images of heavily occluded objects in cluttered scenes, such as fruit clusters in trees, are hard to segment. To further retrieve the 3D size and 6D pose of each individual object in such cases, bounding boxes are not reliable from multiple views since only a little portion of the object’s geometry is captured. We introduce the first CNN-based ellipse detector, called Ellipse R-CNN, to represent and infer occluded objects as ellipses. We first propose a robust and compact ellipse regression based on the Mask R-CNN architecture for elliptical object detection. Our method can infer the parameters of multiple elliptical objects even they are occluded by other neighboring objects. For better occlusion handling, we exploit refined feature regions for the regression stage, and integrate the U-Net structure for learning different occlusion patterns to compute the final detection score. The correctness of ellipse regression is validated through experiments performed on synthetic data of clustered ellipses. We further quantitatively and qualitatively demonstrate that our approach outperforms the state-of-the-art model (i.e., Mask R-CNN followed by ellipse fitting) and its three variants on both synthetic and real datasets of occluded and clustered elliptical objects.

Ellipse regression, occlusion handling, 3D object localization, object detection, convolutional neural networks.
\pdfsuppresswarningpagegroup

=1

## I Introduction

Detection of ellipse-like shapes [1] has been widely used in various image processing tasks, for instance, face detection [2] and medical imaging diagnosis [3]. The above works all investigate the mathematical model of ellipse based on segmented edges, contours, and curvatures [4] from the image to identify different ellipses. However, such traditional methods for ellipse fitting highly rely on pre-processing (such as segmentation and grouping), and thus often fail to detect ellipsoid objects from the image, especially in complex environments. For fruit detection in modern orchard settings, as an example, the edge information of clustered fruits is not salient [5] and largely interfered by nearby obstacles and background scenes, such as leaves and branches (see Fig. 1).

Adapting convolutional neural networks (CNNs) for object detection [6] and instance segmentation (e.g., Mask R-CNN [7]) to this canonical task is a promising way to extract object information. Directly fitting an ellipse on the output mask (i.e., Mask R-CNN in Fig. 1), however, fails to infer the entire object shape. Intuitively, there are two ways to modify the Mask R-CNN model for predicting ellipses: (1) adding a regression model right after the mask branch; (2) performing regression directly on the features from RoiAlign. As we show in the ablation study (see Fig. 6 and Fig. 9), these two variants both lose the ellipse orientation information, while our Ellipse R-CNN by injecting the learned whole object information (see Fig. 1), achieves the best performance, especially in occluded and clustered scenarios. Moreover, the detected ellipses can be further exploited for 3D localization and size estimation of such ellipsoid objects, while the bounding boxes are not reliable in terms of 2D object representation due to insufficient geometric constraints from multiple views [8] (see Fig. 2).

Our goal is to accurately detect and represent the elliptical objects in a 2D image as the input for 3D localization, and successfully infer the whole information of each object in an occluded and clustered scenario. Our approach is motivated by two key ideas. Common elliptical objects, such as apples, oranges and peaches, can be modeled as ellipsoids in 3D, and their back projections on the 2D image should be ellipses. The detection mechanism, just as that in the human brain, should be able to retrieve the whole elliptical objects by focusing on their partially visible boundary information so as to handle different occluded patterns effectively.

Our main contributions are twofold:

• We propose a robust and compact ellipse regression model that detects each individual elliptical object and parameterizes it as an ellipse. The proposed regression method is general and flexible enough to be applicable to any state-of-the-art detection model, in our case, a Mask R-CNN detector [7]. For better accuracy, the proposed feature regions before ellipse regression are refined by bounding box estimation and feature padding. The correction of the regression strategy is validated cross different synthetic datasets containing well-separated and clustered ellipses, respectively. We further analyze the improvement of our ellipse regression in an ablation study using the FDDB dataset [10].

• For better handling occlusion, we integrate the U-Net [11] structure into the detection model to generate decoded feature maps that contain retrieved hidden information. We further propose to learn various occluded patterns such that the detection confidence score is computed by generalizing the occlusion information between the visible part and the whole estimated ellipse. In ablation experiments, we demonstrate that our approach indeed improves the detection performance compared to the Mask R-CNN baseline and its three variants using both synthetic and real datasets of occluded and clustered objects. Moreover, in heavily occlusion settings, our approach achieves the best-reported performance on the datasets.

To the best of our knowledge, this paper is the first work developing CNN-based model to detect objects as ellipses and to predict ellipse parameters in one shot directly from the image, and it is the first attempt to handle occlusion from the perspective of ellipse representation.

## Ii Related Work

Since we develop the Mask R-CNN model as our base object detector to predict ellipse parameters in occluded cases, we review recent work on CNN-based object detectors, 3D object localization, and occlusion handling, respectively.

CNN-Based Object Detectors. Recent success in the general object detection tasks on Pascal [12], ImageNet [13], and MS COCO datasets [14], have been achieved by both single-shot [15, 16] and R-CNN [17, 6, 7] architectures. The single-shot methods formulate object detection as a single-stage regression problem to predict objects extremely fast. The R-CNN approaches by integrating region proposal and classification, have greatly improved the accuracy, and are currently one of the best performing detection paradigms.

Fruit detection, as a challenging example of elliptical object detection, recently has attracted intensive interests in machine vision and agricultural robotics [18, 19, 5]. Although recent works [20, 21] by tunning head layers of R-CNN models, have presented a good performance on well-separated fruits, the detection accuracy drops significantly as fruits cluster and occlude each other. The authors in [22] further report a comparative study of CNN-based models on various apple datasets. In modern orchards, fruit occlusions happen frequently due to the environmental complexity [23], especially when they are occluded by neighboring leaves and branches. It is even more challenging to estimate the 3D size and the 6D pose (position and orientation) of each individual fruit in such occluded and clustered cases.

Object Localization from 2D Detection. Recent research has been developed on the size and pose estimation from object detectors by modeling objects as quadrics in 3D [9, 8], but the whole object information in heavily occluded cases is hardly retrieved from a bounding box that is derived from a little portion of the visible part (see Fig. 2). While some efforts have been made for 3D fruit localization [24, 25, 26, 27] by utilizing standard mapping techniques [28, 29], none of them could perform object size estimation because of low-resolution 3D reconstructions. Moreover, all these works represent visible object parts using bounding boxes, which are not appropriate for further estimating the object size and pose due to the spatial ambiguities in rectangle constraints of the ellipsoid. Specifically, fitting an inscribed ellipse for each bounding box [9] is not reasonable, since there could be infinite solutions for ellipse parameters that all satisfy the bounding box constraints [8] even from multiple views. We thus propose ellipse representation by developing the Mask R-CNN architecture as the baseline, and compare the detection accuracy in terms of masks generated from ellipse parameters.

Occlusion Handling. One of the most common applications that apply occlusion handling strategy is pedestrian detection. The part-based methods [30] propose to learn a set of specific manually designed occlusion patterns, in which either hand-crafted features [31] or pre-defined semantic parts [32] are employed. The drawback is that each occlusion pattern detector is learned separately, and it makes the whole procedure complex and hard to train. In contrast, our approach does not require pre-defining occlusion patterns. Moreover, pedestrians are shown vertically in the image and represented by bounding boxes, whereas the orientation of elliptical objects when described as ellipses cannot be predetermined and thus increases the complexity of learning patterns. We incorporate estimated ellipse parameters to learn a continuous vector that serves as a reference to generalize different learned patterns. The recent work [33] learns implicit information to infer the 6D pose of the object by using an autoencoder structure. In contrast, we integrate the U-Net structure to retrieve occluded information in decoded feature maps.

## Iii Overview of Bounding-Box Regression

Most single-stage and R-CNN based networks formulate the object detection as a regression problem, which outputs a class-specific bounding box for each prediction [15, 16, 34, 17, 6, 7]. In our case (see Fig. 3), the input to the R-CNN models is a set of training pairs , where denotes the pixel coordinates of the center of a region proposal together with its weight and height in pixels. The ground-truth (GT) bounding box corresponding to is defined in the same way as . For example, in Mask R-CNN, proposals are generated by a region proposal network (RPN) [6] after the input image going through the base net (i.e., ResNet-101 [35]). The feature maps for each proposal are cropped from the top convolutional feature maps generated by feature pyramid network (FPN) [36] according to different proposal scales. The following RoiAlign layer [7] reshapes the cropped features to produce feature maps of the same size per proposal for classification and bounding-box regression.

The regression goal is to learn a transformation that maps a proposed region to a ground-truth box . Instead of directly predicting the absolute coordinates and sizes of a bounding box, the model learns to estimate the relative offset parameters , , and to describe how different the proposal is compared to the ground-truth (we drop the superscript for simplicity):

 tx=(P′x−Px)/Pw,ty=(P′y−Py)/Ph, (1) tw=log(P′w/Pw),th=log(P′h/Ph), t∗x=(Gx−Px)/Pw,t∗y=(Gy−Py)/Ph, t∗w=log(Gw/Pw),t∗h=log(Gh/Ph),

where is the regression target, and denotes the predicted box that is recovered from . For the object likelihood (of classes), the model considers the background (i.e., absence of objects) as another class, and predicts the confidence scores of classes. Specifically, for fruit or face detection, there exists only one class of interest () such that only two values are necessary for the output per proposal with the class determined by the highest one.

There are two key benefits of predicting relative offset parameters in Eq. (1) for accurate bounding-box regression:

• All four parameters of each bounding box are normalized such that the objects even with hugely different sizes contribute equally to the total regression loss, which also means that the loss is unaffected by the image size.

• The normalization guarantees that all predicted values are close to zero (with small magnitudes) when the proposed region is near the ground-truth box, which stabilizes the training procedure without outputting unbounded values.

## Iv Proposed Ellipse Regressor

Our key idea for ellipse regression is to infer relative offset parameters directly from visible parts so as to maintain the two key benefits described in Sec. III. By further leaning occluded patterns, the confidence score of visible parts of an occluded object is leveraged from incorporated information of its estimated ellipse. Since Mask R-CNN [7] obtains the state-of-the-art results in general object detection and instance segmentation, we exploit its base model as our front-end network (see Fig. 1).

### Iv-a Formulation of Ellipse Feature Regions

In geometry, a general ellipse oriented arbitrarily (see Fig. 4) can be defined by its five parameters: center coordinates , semi-major and semi-minor axes , (), and rotation angle (from the positive horizontal axis to the major axis of the ellipse). The canonical form of the general ellipse [37] is obtained as follows

 (x′cosΘ+y′sinΘ)2a2+(−x′sinΘ+y′cosΘ)2b2=1,x′=x−xo,y′=y−yo, (2)

where the ellipse orientation is . We aim to train a regressor for predicting all five ellipse parameters, given a set of training pairs as the input to the ellipse regressor, where is the ground-truth ellipse characterized by and is denoted in the same way as in Sec. III. This can be thought of as ellipse regression from a proposed feature region to a nearby ground-truth ellipse.

However, the strategy of bounding-box regression cannot be directly applied to ellipse regression. The major challenge comes from how to accurately keep the ellipse orientation information in each feature region before the regression stage. For example, the RoiAlign layer [7] in the state-of-the-art methods resizes the rectangular proposed regions of various shapes as squares of a fixed size, but this distorts the features maps and makes the prediction of the original ellipse orientation information unstable (see Fig. 4 for more details).

We therefore propose, before the resizing operation, to extend each rectangular feature proposal as a squared region , whose length only depends on the axes sizes of the ellipse to be predicted. The length of the extended square region is derived as follows. In Eq. (2), we take the derivative of with respect to :

 \scalebox0.95$∂y∂x=−a2x′sin2Θ−y′(a2−b2)sinΘcosΘ+b2x′cos2Θa2x′cos2Θ−x′(a2−b2)sinΘcosΘ+b2y′sin2Θ$. (3)

To determine the axis-aligned bounding box for the ellipse, we equate the numerator and denominator of Eq. (3) to zero separately, since zero numerator and denominator correspond to horizontal and vertical tangents of the ellipse, respectively. The bounding-box length along each axis is solved as:

 {Δx=2√a2cos2Θ+b2sin2ΘΔy=2√a2sin2Θ+b2cos2Θ. (4)

We further create an extended square enclosing the ellipse bounding box, whose diagonal length is defined as the square length .

Thus, given a proposal closely bounding the ellipse , we extend it as the square sharing the same center with its length as . Besides no distortion of ellipse orientation, the other advantage of the extended feature region is that its size is still proportional to the ellipse size ( and ) but independent on the ellipse angle . It implies that even ellipses of the same size ( and ) but with different orientation angles will contribute equally to the regression loss (otherwise the different sizes of their axis-aligned bounding boxes will weight inconsistently in the loss and make the regression model sensitive to the ellipse angle). Our extending strategy also addresses the issue in direct resizing methods (see Fig. 4): since the prediction of is coupled with the ellipse region shape (see and in Eq. (4)), the resizing step complicates the orientation learning process. The extended feature region thus serves as a stable reference to accurately predict ellipse offset parameters.

### Iv-B Ellipse Offsets Prediction

Given a squared feature region extended from a proposal , our goal is to learn regressing features within to a set of relative offset parameters between and a ground-truth ellipse . We start from predicting elliptical objects without occlusion, and propose stable offset parameters to handle occluded cases.

#### Unoccluded Ellipse Prediction

For well-separated objects, we parameterize the regression in terms of five outputs , , , and (superscript is dropped for simplicity):

 δx=(E′x−Qx)/Ql,δy=(E′y−Qy)/Ql,δΘ=E′Θ/π, (5) δa=log(2E′a/Ql),δb=log(2E′b/Ql), δ∗x=(Ex−Qx)/Ql,δ∗y=(Ey−Qy)/Ql,δ∗Θ=EΘ/π, δ∗a=log(2Ea/Ql),δ∗b=log(2Eb/Ql),

where the range of the GT ellipse angle is , is the ellipse regression target, and is the predicted ellipse calculated from . specifies the scale-invariant translation of the center of to , while and specify the log-space translations of the size of to semi-major and semi-minor axes of , respectively. is the prediction of the normalized orientation of . In such an unoccluded case (), the predicted offset values are all bounded when the proposed region () is located close to the ground-truth ellipse (see Fig. 5).

#### Occluded Ellipse Prediction

For occluded object detection, training RPN [6] to propose regions of visible parts (instead of whole object regions) highly reduces false positives as shown in Sec. V. We infer the whole elliptical object from its visible part through ellipse regression.

However, as the visible region goes small () and locates around the object boundary (, and ), the target values to learn (, , , and ) in Eq. (5) have unboundedly large magnitudes, which unstabilizes the training process. We thus propose to predict one more offset parameter for the scale (see Fig. 5):

 δx=s′(E′x−Qx)/Ql,δy=s′(E′y−Qy)/Ql, (6) δa=log(2s′E′a/Ql),δb=log(2s′E′b/Ql), δs=log((s′+1)/2),δΘ=E′Θ/π, δ∗x=s(Ex−Qx)/Ql,δ∗y=s(Ey−Qy)/Ql, δ∗a=log(2sEa/Ql),δ∗b=log(2sEb/Ql), δ∗s=log((s+1)/2),δ∗Θ=EΘ/π,

where , characterizing the visibility ratio calculated between the size of the extended square (from the visible part) and the length of the square enclosing the ellipse (i.e., the whole object region). By predicting , we transfer the offset reference from the visible part to the whole object region , which guarantees that all predicted values (with the target ) are bounded even in heavily occluded cases (when the proposed region is near the small visible part). Specifically, as and , Eq. (6) and Eq. (5) are equivalent, which means that Eq. (6) is a generalized formulation of ellipse offsets prediction that can handle both unoccluded and occluded cases.

After learning such offset parameters , we can transform an input extended region into a predicted ellipse by applying the transformation:

 E′x=Qls′δx+Qx,E′y=Qls′δy+Qy,s′=2exp(δs)−1, (7) E′a=Ql2s′exp(δa),E′b=Ql2s′exp(δb),θ′=πδΘ, E′Θ={atan2(sinθ′,cosθ′)if cosθ′≥0atan2(−sinθ′,−cosθ′)if cosθ′<0,

where is rectified from such that .

#### Ellipse Regression Loss

For a proposed region , we define the regression loss as:

 (8)

where indicates that is positive (if the intersection-over-union (IoU) overlap with its ground-truth box is higher than a ratio [6]), while if is non-positive. is the robust loss function (smooth ) defined in [17], and is the transformation function defined as:

 ρ(iδΘ,iδ∗Θ) ={atan2(siniφ,cosiφ)if cosiφ≥0atan2(−siniφ,−cosiφ)if cosiφ<0, (9) iφ =iδΘ−iδ∗Θ,

which rectifies the ellipse orientation loss of compared to around critical angles (for example, the angle difference between and should be zero rather than ).

### Iv-C Feature Region Refinement

Traditional R-CNN based methods generate regression and classification outputs directly from proposed feature regions. However, relying on only roughly proposed feature maps from RPN maybe risky and error-prone especially to predict ellipse orientation in heavily occluded cases (see Fig. 6). Specifically, there exists a mismatch between a predicted visible region and its feature representation (see Fig. 7). Thus, our idea is to perform ellipse regression and classification based on the refined feature region output by a bounding-box regressor. This strategy alleviates the issue by allowing the model to exploit the features of the exact predicted visible region, which makes the inference output more reliable.

Based on the extended predicted region, the RoiAlign layer re-extracts a small feature map (e.g., 1414), and accurately aligns the extracted features with the input from FPN. Features in the extended square but located outside the predicted visible region have a negative effect on predicting accurate ellipse parameters (see Fig. 6). To reduce the interference of such unrelated features, we perform zero padding on the extended feature area. Our proposed method is simple: we use floor and ceiling operations to compute the boundaries of the smallest rectangle that encloses the bilinear-interpolated feature map from the predicted region, and pad zeros in the rest area of the extended square. For example, and are two width limits of the resized rectangle whose center is assumed at , where is the width of the predicted region and is the resizing factor. The refined feature region leads to large improvements as we show in Sec V.

### Iv-D Learning Occlusion Patterns

Diverse appearances of occluded objects lead to a large variety of occlusion patterns (see Fig. 1). Traditional networks are likely to assign a low confidence to an occluded object due to its hidden parts. Our key idea for occlusion handling is to employ channel-wise attention in refined features by learning different occlusion patterns in one coherent model. Our model can leverage the prediction confidence of the visible part of an elliptical object based on the inference of the whole ellipse from the occlusion (see Fig. 8).

#### Occluded Ellipse Patterns

Given the refined features of a predicted visible region, we exploit a U-Net [11] structure to learn the occluded ellipse shape within the extended square (see Fig. 8). The ground truth of the occluded ellipse shape is generated as follows. For an occluded object, we identify a bounding box of the visible part with its ellipse parameters. The GT whole ellipse generated is then cropped and resized by a predicted visible region, and put centered in the extended square. The GT visible ellipse is thus obtained without being occluded by other nearby obstacles. Unlike previous work [32, 40], our method does not relies on any particular discrete set of occlusion patterns or any external classifier for guidance, and thus can be trained in an end-to-end manner.

By learning occluded ellipse patterns, the low-dimensional latent features encode both partial visibility and ellipse shape information [33]. Therefore, we perform ellipse regression directly from the latent features. The ellipse offsets are obtained via a multilayer perceptron (MLP) [41] (see Fig. 8).

#### Visible Part Attention

Many recent works [44, 45] find that convolution filters of different feature channels respond to their specific high-level concepts, which are associated with different semantic parts. To leverage the detection confidence in occluded cases, our intuition is to allow the network to decide how much each channel should contribute in the refined features . Specifically, the channels representing the visible parts should be weighted more, while the occluded parts be weighted less. We thus re-weight the refined features as :

 fo=w⊤rfr,fr=[f1,f2,...,fH]⊤, (10)

where is the attention weighting vector regressed from the latent features (learned partial visibility) by an MLP, and is the total number of channels (e.g., 256). The re-weighted features is further regressed as a feature for classification.

Various ellipse orientations may increase the learning complexity of occlusion patterns. To compensate for the orientation effect, we propose to concatenate the feature with a latent feature (used for ellipse regression in Fig. 8) to incorporate both partial visibility and whole ellipse information. The concatenated feature thus learns various occlusion patterns, and passes through the classification head to output the final confidence scores.

#### Training Objective

The R-CNN based models have two types of losses: RPN loss and head loss [6] (composed of classification loss and regression loss ). We redefine as the sum of the loss of the feature region refinement and the ellipse regression loss . On top of that, our occlusion handling introduces one additional loss defined as the average binary cross-entropy loss. The loss function of the whole system can be written as follows:

 L=LRPN+1N∑i(Lcls(i)+ip∗(Lreg(i)+Loccl(i))), (11)

where the loss is over two classes (object vs. background), and the GT label is 1 if feature region (in total regions) is positive (as an object) otherwise is 0.

## V Experiments

In this section, we first introduce synthetic and real datasets we use for the experiments, followed by a description of the implementation details and evaluation metrics. After that, we show experimental results of the ablation study for our Ellipse R-CNN detector, and make a comparison to the state of the art. In the end, we demonstrate how Ellipse R-CNN helps improve the accuracy of 3D object estimation in occluded cases.

### V-a Datasets

We validate the proposed Ellipse R-CNN on four datasets: synthetic occluded ellipses (SOE), synthetic occluded fruits (SOF), real occluded fruits (ROF) and FDDB [10] datasets. Each elliptical object is annotated by its five ellipse parameters of the whole object region along with a bounding box of the visible part (except for the FDDB dataset as shown in Fig. 5).

The SOE dataset consists of 16,500 images in total, approximately 15,000 images are for training and the rest for testing. Synthetic images are generated from a cluster of different ellipses occluded from each other in the same distribution as in Fig. 6. The image background is randomly filled by the Pascal dataset [12] with randomly added triangles (simulating nearby obstacles) to further occlude ellipses (the visibility ratio of each ). To introduce more interference, ellipse colors are randomly generated in a roughly same tone as in real cases (e.g., clustered fruits and faces).

The SOF dataset contains 3,545 images (3,040 for training and 505 for testing) of a cluster of fruits occluded in a realistic tree (), which are generated by changing different poses and sizes of each model in Unreal Engine (UE) with the background randomly filled by images taken from different real orchards [27]. The GT ellipses are obtained by projecting the 3D fruit ellipsoids onto the corresponding images [47] based on camera poses.

The ROF dataset (1115 images in total) is human-annotated and is built upon MinneApple [48] and ACFR [20] datasets, from which we crop out the sub-images of heavily occluded fruit clusters. We perform a similar training-and-test split as in [20], which are composed of 900 images and 215 images, respectively. FDDB dataset [10] includes 2,845 images of 5,171 faces that are split by ten folds. Since most faces are well-separated and only have GT ellipses (without GT visible boxes), we just demonstrate the generalization of our ellipse regressor on this dataset through 10-fold cross-validation.

### V-B Implementation Details

We use TensorFlow [49] to implement and train the Ellipse R-CNN. For comparison, we directly use the source code of Mask R-CNN provided by Matterport [50]. For the training, we use the pre-trained weights for MS COCO [14] to initialize the Ellipse R-CNN, and use a step strategy with mini-batch stochastic gradient descent (SGD) to train the networks on a GeForce GTX 1080 GPU. On SOF, ROF, and FDDB datasets, we train with an initial learning rate of for 20,000 iterations and train for another 10,000 iterations with a decreased learning rate of . On the SOE dataset, we start with the same learning rate of , and then decrease the learning rate by 5 after every 20,000 iterations. The model converges at 50,000 iterations. During the training, we perform on-the-fly data augmentation with flipping, shifting, and rotation at random. We resize the ellipse and fruit images to 128128, while the face images are resized to 256256 in order to have face details in a higher resolution for training and testing.

### V-C Evaluation Metrics

Four evaluation metrics are exploited in all of our experiments: average precision (AP [14] over ellipse IoU thresholds), log-average miss rate (MR) [51], and (AP and MR over ellipse angle errors). MR is the average value of miss rates for 9 FPPI (false positives per image) rates evenly spaced in the log-space ranging from to . By introducing and , we focus more on the accuracy of predicted ellipse angles. For example, we consider a prediction (evaluated by or ) as a false positive if its ellipse IoU is less than 0.75 (set as the default IoU) or its angle error is greater than . To clearly show the performance difference, we use a strict criteria: for instance, the IoU level starts from 0.75 up to 0.95 with an interval 0.05 (e.g., written as ), and the angle error decreases from to with an interval (e.g., written as ). We use AP and MR to measure the overall performance as they place a significantly large emphasis on localization and miss rate in occluded cases, respectively.

### V-D Performance of Ellipse R-CNN

We compare the proposed Ellipse R-CNN to the baseline model Mask R-CNN, which obtains the state-of-the-art results in general object detection and instance segmentation. Since our model is the first work of ellipse regression, to make a fair comparison, we fit ellipses directly from the mask outputs of Mask R-CNN (trained on the regions of whole objects) using the method of minimum volume enclosing ellipsoid [52] in 2D (i.e., Mask R-CNN+). We run a number of ablations to further analyze Ellipse R-CNN. For the ablation study of occlusion handling, we adapt two state-of-the-art methods in our model: DeepParts+ and SENet+. In DeepParts+, we only keep the U-Net structure to learn a set of 45 occlusion patterns, and the final score is obtained via an MLP on the part detection scores [32]. For SENet+, we learn the attention vector directly from the refined feature maps (without U-Net), and perform the classification only on the re-weighted features [46].

#### Accuracy of Ellipse Regression

The key component of our Ellipse R-CNN is the ellipse regressor. Some examples of detected elliptical objects are illustrated in Fig. 912. Table III show the breakdown performance of the ellipse prediction on the SOE and SOF datasets whose GT is perfectly generated based on the geometry of object models. Our strategy of ellipse regression (e.g., ellipse R-CNN-) leads significant performance improvement on all metrics compared to the baseline model. Specifically, Table I shows that both and values of the proposed model are not sensitive to the increased levels of angle errors, which means that our strategy achieves a high accuracy of ellipse orientation prediction. We also observe that the Mask R-CNN+ model trained on whole object regions (instead of visible parts) suffers from outputting more false positives due to the high similarities among the proposed feature regions (see Fig. 911). For the ROF dataset, Table III shows a higher sensitivity of our model on and compared to those on SOE and SOF datasets: is higher than that in Table II but drops a lot. The reason is that most human-annotated fruits are close to circles whose GT orientation information is noisy and inconsistent. Thus, it is hard to quantify the results on and but our proposed model still achieves the best performance on AP and MR.

#### Validity of Feature Region Refinement

Table IIII show the detailed breakdown performance of the proposed feature region refinement (i.e., Ellipse R-CNN- with R) on the SOE, SOF and ROF datasets. The performance is largely improved when the refined features are used for ellipse regression and classification. The improvements in and indicate that the refinement strategy is not only beneficial to increasing the accuracy of ellipse region prediction but also to reducing the false positives for classification, especially in occluded cases. However, Table IV shows smaller improvements if we apply the feature refinement strategy on the FDDB dataset: and are only improved by 5.5 and 4.6, respectively. As discussed in Sec. IV-C, feature region refinement is used to remove the interference of nearby occlusions. Most faces in the FDDB datasets are well-separated and there are few clustered and occluded cases. Thus, the improvements in Table IV by using the refined features are not as significant as those in Table IIII.

#### Performance of Occlusion Handling

One of our evaluation goals is occlusion handling, whose overall performance is measured by MR and as shown in Table IIII. All three variants with different mechanisms of occlusion handling show some improvements to the baseline (i.e., Ellipse R-CNN- with R), ranging from 2.1 to 7.2 on and from 4.3 to 8.6 on . Overall, the error rates can be sorted in the following order: DeepParts+ SENet+ Ellipse R-CNN*. The reason is that the DeepParts+ is limited by its fixed number of occlusion patterns to learn, while the SENet+ learns a continuous attention vector to adjust feature weights but lacks the whole ellipse information to generalize different occlusions. We further compare our Ellipse R-CNN to the Ellipse R-CNN* (without concatenating ). The gap between them demonstrates that our concatenation of with is a more effective way of generalizing various occlusion patterns from ellipse predictions.

#### Generalization of Ellipse Regressor

In order to investigate the generalization ability of the proposed ellipse regressor, we also perform experiments on the FDDB dataset. Since no GT visible boxes are available and few objects are clustered, we can only evaluate our model without the occlusion handling mechanism (i.e., Ellipse R-CNN- with R). Focusing on the accuracy of orientation prediction, we show the results of 10-fold cross-validation in Table IV, where we can see that our model outperforms the Mask R-CNN+ baseline by 13.1 on and 8.6 on , respectively. We also show some qualitative results in Fig. 12, where we can observe that our detector produces robust detections of ellipses even in some extreme cases. Specifically, in all seven examples, several faces are heavily occluded by the image boundaries. The Mask R-CNN+ produces many distorted face shapes, while our detector accurately infers the whole ellipse regions for all of them.

#### Discussion on 3D Object Estimation

In order to understand how Ellipse R-CNN improves the accuracy of 3D object estimation, we implement the multi-view 3D localization using quadrics [9] from 2D detections on the SOF dataset. We compare our detector with the Mask R-CNN and summarize the results in Table V. The evaluation metrics include rotation error [53], position error and relative size error in 3D that are averaged over all objects. For each UE setting (24 different settings in total), we select three images taken from different view angles to serve as the same inputs for both methods. As shown in the comparison, three estimation errors of the Ellipse R-CNN are much lower than the Mask R-CNN+, especially the rotation error (i.e., vs. ). This is because Ellipse R-CNN better infers the whole region of each object directly from the visible part, thus is more effective in estimating the 3D pose and shape of objects from occlusion. More qualitative results are shown in Fig. 13.

## Vi Conclusion

This paper shows that traditional R-CNN methods are not well-suited for ellipse fitting since they only predict bounding boxes that have no orientation information for objects, and they are typically trained on the whole object regions in occluded cases. This makes those deep models suffer from outputting a large number of false positives and being unreliable to serve as the inputs for further 3D estimation of the object pose and its sizes. We thus propose the Ellipse R-CNN to focus on the visible regions and infer the whole elliptical objects as ellipses from heavy occlusions. A robust ellipse regression is formulated to generalize both occluded and unoccluded case. Our model firstly learns various occlusion patterns of ellipses within the refined visible regions, then generates the final classification score by integrating the visibility information from an attention vector and the whole object information from the regressed ellipse. In this way, the model learns discriminative representations of occluded objects, which are robust in differently oriented scenarios. Extensive experimental results on two synthetic datasets and two real datasets demonstrate the advantages of our model compared to the Mask R-CNN. The current approach for 3D object estimation weights equally each predicted ellipse parameter from 2D detections. Our future work would investigate predicting the uncertainties for all ellipse parameters to further boost the accuracy of the 3D object estimation system.

## Acknowledgment

We thank our colleagues Nicolai Häni and Zhihang Deng from the University of Minnesota, for providing valuable feedback and technical support throughout this research.

### References

1. Y. Xie and Q. Ji, “A new efficient ellipse detection method,” in Object recognition supported by user interaction for service robots, vol. 2.   IEEE, 2002, pp. 957–960.
2. S.-C. Zhang and Z.-Q. Liu, “A robust, real-time ellipse detector,” Pattern Recognition, vol. 38, no. 2, pp. 273–287, 2005.
3. W. Lu and J. Tan, “Detection of incomplete ellipse in images with strong noise by iterative randomized hough transform (irht),” Pattern Recognition, vol. 41, no. 4, pp. 1268–1279, 2008.
4. D. K. Prasad, M. K. Leung, and S.-Y. Cho, “Edge curvature and convexity based ellipse detection method,” Pattern Recognition, vol. 45, no. 9, pp. 3204–3221, 2012.
5. P. Roy and V. Isler, “Vision-based apple counting and yield estimation,” in International Symposium on Experimental Robotics.   Springer, 2016, pp. 478–487.
6. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 1137–1149, 2017.
7. K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
8. L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2019.
9. C. Rubino, M. Crocco, and A. Del Bue, “3d object localisation from multi-view image detections,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1281–1294, 2018.
10. V. Jain and E. Learned-Miller, “Fddb: A benchmark for face detection in unconstrained settings,” 2010.
11. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
12. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
13. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
14. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
15. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
16. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision.   Springer, 2016, pp. 21–37.
17. R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
18. Q. Wang, S. Nuske, M. Bergerman, and S. Singh, “Automated crop yield estimation for apple orchards,” in Experimental robotics.   Springer, 2013, pp. 745–758.
19. C. Hung, J. Underwood, J. Nieto, and S. Sukkarieh, “A feature learning based approach for automated fruit yield estimation,” in Field and service robotics.   Springer, 2015, pp. 485–498.
20. S. Bargoti and J. Underwood, “Deep fruit detection in orchards,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 3626–3633.
21. N. Häni, P. Roy, and V. Isler, “Apple counting using convolutional neural networks,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 2559–2565.
22. N. HÃ¤ni, P. Roy, and V. Isler, “A comparative study of fruit detection and counting methods for yield mapping in apple orchards,” Journal of Field Robotics, 2018.
23. W. Dong and V. Isler, “Linear velocity from commotion motion,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 3467–3472.
24. J. Das, G. Cross, C. Qu, A. Makineni, P. Tokekar, Y. Mulgaonkar, and V. Kumar, “Devices, systems, and methods for automated monitoring enabling precision agriculture,” in 2015 IEEE International Conference on Automation Science and Engineering (CASE).   IEEE, 2015, pp. 462–469.
25. P. Roy and V. Isler, “Surveying apple orchards with a monocular vision system,” in 2016 IEEE International Conference on Automation Science and Engineering (CASE).   IEEE, 2016, pp. 916–921.
26. P. Roy, W. Dong, and V. Isler, “Registering reconstructions of the two sides of fruit tree rows,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 1–9.
27. W. Dong, P. Roy, and V. Isler, “Semantic mapping for orchard environments by merging two-sides reconstructions of tree rows,” Journal of Field Robotics, 2018.
28. C. Wu, “Towards linear-time incremental structure from motion,” in 2013 International Conference on 3D Vision-3DV 2013.   IEEE, 2013, pp. 127–134.
29. R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
30. M. Mathias, R. Benenson, R. Timofte, and L. Van Gool, “Handling occlusions with franken-classifiers,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1505–1512.
31. M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila, “Multi-cue pedestrian classification with partial occlusion handling,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.   IEEE, 2010, pp. 990–997.
32. Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning strong parts for pedestrian detection,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1904–1912.
33. M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel, “Implicit 3d orientation learning for 6d object detection from rgb images,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 699–715.
34. J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
35. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
36. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
37. R. Larson, Precalculus with limits: A graphing approach.   Nelson Education, 2014.
38. C. Y. Young, Precalculus.   John Wiley & Sons, 2010, ch. 9.
39. L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner, “dsprites: disentanglement testing sprites dataset (2017),” URL https://github. com/deepmind/dsprites-dataset, 2017.
40. C. Zhou and J. Yuan, “Multi-label learning of part detectors for heavily occluded pedestrian detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3486–3495.
41. C. M. Bishop et al., Neural networks for pattern recognition.   Oxford university press, 1995.
42. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
43. V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
44. D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6541–6549.
45. A. Gonzalez-Garcia, D. Modolo, and V. Ferrari, “Do semantic parts emerge in convolutional neural networks?” International Journal of Computer Vision, vol. 126, no. 5, pp. 476–494, 2018.
46. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
47. W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S. Kim, and Y. Wang, “Unrealcv: Virtual worlds for computer vision,” in Proceedings of the 25th ACM international conference on Multimedia.   ACM, 2017, pp. 1221–1224.
48. N. Häni, P. Roy, and V. Isler, “Minneapple: A benchmark dataset for apple detection and segmentation,” arXiv preprint arXiv:1909.06441, 2019.
49. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
50. W. Abdulla, “Mask r-cnn for object detection and instance segmentation on keras and tensorflow,” https://github.com/matterport/Mask_RCNN, 2017.
51. P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 4, pp. 743–761, 2011.
52. N. Moshtagh et al., “Minimum volume enclosing ellipsoid,” Convex optimization, vol. 111, no. January, pp. 1–9, 2005.
53. W. Dong and V. Isler, “A novel method for the extrinsic calibration of a 2d laser rangefinder and a camera,” IEEE Sensors Journal, vol. 18, no. 10, pp. 4200–4211, 2018.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters