Weakly-Supervised Amodal Instance Segmentation with Compositional Priors
Amodal segmentation in biological vision refers to the perception of the entire object when only a fraction is visible. This ability of seeing through occluders and reasoning about occlusion is innate to biological vision but not adequately modeled in current machine vision approaches. A key challenge is that ground-truth supervisions of amodal object segmentation are inherently difficult to obtain. In this paper, we present a neural network architecture that is capable of amodal perception, when weakly supervised with standard (inmodal) bounding box annotations. Our model extends compositional convolutional neural networks (CompositionalNets), which have been shown to be robust to partial occlusion by explicitly representing objects as composition of parts. In particular, we extend CompositionalNets by: 1) Expanding the innate part-voting mechanism in the CompositionalNets to perform instance segmentation; 2) and by exploiting the internal representations of CompositionalNets to enable amodal completion for both bounding box and segmentation mask. Our extensive experiments show that our proposed model can segment amodal masks robustly, with much improved mask prediction qualities compared to state-of-the-art amodal segmentation approaches.
In our everyday life, we often observe partially occluded objects. Despite the occluders having highly variable forms and appearances, our human vision system can localize and segment the visible parts of the object, and use them as cues to approximately perceive complete structure of the object. This perception of the object’s complete structure under occlusion is referred to as amodal perception (Nanay, 2018). Likewise, the perception of visible parts is known to as modal perception.
In computer vision, amodal instance segmentation is important to study, both for its theoretical values and real-world applications. Its theoretical similarity to human vision allows for additional insights to the structures of the visual pathway. Also, its real-world importance can be found in the benefits of seeing through the occluder and perceiving partially occluded vehicles in their completeness during autonomous driving. In order to perform amodal segmentation, a vision model must be robust to partial occlusion. Recent works have shown that current deep learning approaches are far less robust than humans at classifying partially occluded objects (Zhu et al., 2019; Kortylewski et al., 2019). In contrast to deep convolutional neural networks (DCNNs), compositional models are much more robust to partial occlusions, as they gained their robustness by mimicking the compositionality of human cognition and sharing similar characteristics with biological vision systems, such as bottom-up object part encoding and top-down attention modulations in the ventral stream (Sasikumar et al., 2018; Roe et al., 2012; Carlson et al., 2011).
Recently, Compositional Convolutional Neural Networks (CompositionalNets) have been proposed as compositional models built upon neural feature activations, which can robustly classify and detect objects under partial occlusions (Kortylewski et al., 2020a). More specifically, Wang et al. (2020) proposed Context-Aware CompositionalNets, which decompose the image into a mixture of object and context.Although Context-Aware CompositionalNets are shown to be robust at detecting objects under partial occlusion, for obvious reasons, they are not sufficient to perform weakly-supervised amodal segmentation. 1) Context-Aware CompositionalNets lack internal priors of the object shape, and therefore cannot perform amodal segmentation. 2) Context-Aware CompositionalNets gather votes from the object parts to vote for an object level classification. This is not sufficient, however, since amodal segmentation requires pixel-level classification. 3) Context-Aware CompositionalNets are high precision models that require the object center to be aligned to the center of the image. However, in practice it is difficult to locate the object center, because only partial bounding box proposals are available for partially occluded objects.
In this work, we propose to build on and significantly extend Context-Aware CompositionalNets in order to enable them to perform amodal instance segmentation robustly with modal bounding box supervision. In particular, we introduce a two-stage model. First, we classify a proposed region and estimate its amodal bounding box via localization of the proposed region on the complete structural representations of the predicted object. Then, we perform per-pixel classification in the estimated amodal region, identifying both visible and invisible regions of the object in order to compute the amodal segmentation mask. Our extensive experiments show that our proposed model can segment amodal masks robustly, with much improved mask prediction qualities compared to current methods under various supervisions. In summary, we make several important contributions in this work:
Introduced spatial priors that explicitly encode the prior knowledge of the object’s pose and shape in the compositional representation, thus enabling weakly-supervised segmentation.
Implemented Partial Classification which maintain the model’s accuracy with incomplete object proposals by sampling over all possible spatial placement of the proposal within the internal representation.
Implemented Amodal Completion from partial bounding boxes by enforcing symmetry upon the maximum deviation from objective center caused by the spatial placement.
Implemented Amodal Segmentation by explicitly classifying the visible and invisible regions within the estimated amodal proposal.
2 Related Work
Robustness to Occlusion In image classification, typical DCNN approaches are significantly less robust to partial occlusions than human vision (Zhu et al., 2019; Kortylewski et al., 2019). Although some efforts in data augmentation with partial occlusion or top-down cues are shown to be effective in reinforcing robustness (DeVries and Taylor, 2017; Xiao et al., 2019), Wang et al. (2020) demonstrate that these efforts are still limited. In object detection, a number of deep learning approaches have been proposed by Zhang et al. (2018) and Narasimhan (2019) for detecting occluded objects; however, these require detailed part-level annotations occlusion reconstruction. In contrast, CompositionalNets, which integrate compositional models with DCNN architecture, are significantly more robust to partial occlusion in image classification under occlusion. Additionally, Context-Aware CompositionalNets, which disentangle its foreground and context representation, are shown to be more robust in object detection under occlusion.
Weakly-supervised Instance Segmentation. Observed in biological vision, pixel-level annotations are not necessary to accomplish object segmentation, since distinguishing between foreground and context in a given region is mainly automatic. Similarly, the feasibility of weakly-supervised instance segmentation in computer vision has been explored. Hsu et al. (2019) achieves figure/ground separation by exploiting the bounding box tightness prior to generate positive and negative bangs based on the sweeping lines of each bounding box. Additionally, Zhou et al. (2018) propose to use image-level annotations to supervise instance segmentation by exploiting class peak responses to enable a classification network for instance mask extraction.
Amodal Perception. One of the first works in amodal instance segmentation was proposed by Li and Malik (2016), with an artificially generated occlusion dataset. Recently, with the release of datasets that contain pixel-level amodal mask annotations, such as KINS and Amodal COCO, further progress has been made (Qi et al., 2019; Zhu et al., 2017). For instance, Zhan et al. (2020) propose a self-supervised network that performs scene de-occlusion, which recovers hidden scene structures without ordering and amodal annotations as supervisions. However, their approach assumes mutual occlusions, thus unfit to perform amodal segmentation when the occluding object is not annotated in the dataset.
3 Weakly Supervised Amodal Segmentation
In Section 3.1, we discuss prior work on CompositionalNets and Context-Aware CompositionalNets. We discuss our extensions to the probabilistic model of Context-Aware CompositionalNets and how they enable weakly-supervised amodal instance segmentation in Section 3.2. Lastly, we discuss the end-to-end training of our model for weakly supervised amodal segmentation in Section 3.3.
Notation. The output of the layer in the DCNN is referred to as feature map , where and are the input image and the parameters of the feature extractor, respectively. Feature vectors are vectors in the feature map, at position , where is defined on the 2D lattice of with being the number of channels in the layer . We omit subscript in the following for clarity since the layer is fixed a priori in the experiments.
3.1 Prior Work: Context-Aware CompositionalNets
CompositionalNets. CompositionalNets, as proposed by Kortylewski et al. (2020b), are DCNN classifiers that are inherently robust to partial occlusion. Their architecture resembles that of a regular DCNN architecture, but the fully connected head is replaced with a differentiable compositional model built upon the feature activations . They define a probabilistic generative model with being the category of the object. Specifically, the compositional model is defined as a mixture of von-Mises-Fisher (vMF) distributions:
Here is the number of mixtures of compositional models per each object category and is a binary assignment variable that indicates which mixture component is active. are the overall compositional model parameters for the category and are the parameters of the mixture components at every position on the 2D lattice of the feature map . In particular, are the vMF mixture coefficients and are the parameters of the vMF mixture distributions. Note that is the number of parameters in the vMF mixture distributions and the sum across all vMF mixture coefficients, .
where is the normalization constant. The model parameters can be trained end-to-end as described in Kortylewski et al. (2020b).
Context awareness. As introduced by Wang et al. (2020), context-aware CompositionalNets expand on the standard CompositionalNets and explicitly separates the representation of the context from the object by representing the feature map as a mixture of two.
Here, the object representation is disentangled into the foreground representation and context representation . The scalar is a prior that controls the trade-off between context and object, which is fixed a priori at test time. It is shown that although context is helpful in detecting objects under partial occlusions, relying too strongly on context can be misleading when objects are strongly occluded, leading to a relatively high object confidence in background regions.
In order to achieve foreground/context disentanglement, training images are segmented into either object or context based on the contextual feature centers, , learned through available bounding box annotation. Here, the assumption is that any feature with receptive field outside of the bounding boxes is considered to be contextual features. Thus, a dictionary of context feature centers can be learned through clustering the population of randomly extract contextual features using K-means++ (Arthur and Vassilvitskii, 2007). Finally, the binary classification of the feature vector to either foreground, , or context, , is determined such that:
3.2 Weakly-supervised Amodal Instance Segmentation
Segmentation with Spatial Compositional Priors. The Context-Aware CompositionalNets, as proposed by Wang et al. (2020), generates object-level predictions, i.e. class labels, by gathering votes from local part detectors. Our objective, on the other hand, is to generate pixel-level predictions to perform instance segmentation. A simple strategy would be to use the ratio between the context and the foreground likelihood from Equation 4. While this can give reasonable results shown by Kortylewski et al. (2020c), a major limitation of this approach is that the prior is independent of the position and the object pose . However, the likelihood of a feature being part of the context is clearly dependent on the shape of the object and hence depends on these variables.
Therefore, we propose a spatial prior to explicitly encode the prior knowledge of the object pose and shape in the representation model. Seen in Figure 1, the compositional prior is defined over all position and for every mixture . Note how the prior clearly resembles the object shape and 3D pose. Formally, we can learn by computing the average foreground segmentation of each training image that is used to train the mixture model of class . We extend the probabilistic compositional model to incorporate the learned spatial priors as a mixture model:
To segment into foreground and context , we use the ratio between the two components:
The spatial prior also allows us to estimate the context separation during training more accurately in an EM-type manner. In particular, we perform an initial segmentation following the approach proposed by Wang et al. (2020). Subsequently, we learn the spatial prior and update the initial segmentation using Equation 7. As illustrated in the Figure 1b, the spatial prior is optimized in both its tightness and confidence through the iterative updates, since utilizing explicit prior knowledge of the object shape outperforms the contextual features at instance segmentation.
Maximum likelihood Alignment of Partial Feature Maps. As pointed out by Wang et al. (2020), CompositionalNets are high-precision models because they assume that the object is aligned to the center of the compositional model. However, this assumption is only valid if the amodal bounding box is available and hence would not work when a bounding box proposal only contains a part of the object. This poses as a substantial barrier to apply it to amodal perception, since targeted objects are occluded and amodal bounding boxes may not be avaliable during training or inference. Therefore, we propose to obtain the maximum likelihood alignment of feature maps by searching over the spatial placement of on the compositional representation . This will remove the alignment constraint and, consequently, allow us to leverage partial proposals for amodal perception.
Here, denotes with a particular zero padding that aligns the top left corner of to the position defined on the 2D lattice of the internal compositional representation , where and being the spatial dimension of and , respectively.
Shown the Figure 1c, by maximizing the likelihood of on the representation, we would be able to localize correctly to the compositional representation. As we will show in the next section, such localization is used to estimate the amodal region, combined with the compositional priors.
Amodal Bounding Box Completion. After obtaining the corresponding coordinate and representation model , we proceed to estimate the complete structure of the object and perform amodal completion on the bounding box level. The estimation of amodal bounding box depends both on the compositional prior and the localization of on the representation. For the rest of this paragraph, we shift the global axis from the image to the representation. The object center, in this case, is trivially defined as the center of the representation, . Assuming that any bounding box is defined in a form where and are the top left and bottom right of the box, respectively. We proposed the estimation of amodal bounding box from modal bounding box :
Here, denotes the maximum displacement vector observed at localization . By applying symmetrically to the object center , an amodal estimation of the object region is generated.
Amodal Instance Segmentation with CompositionalNet. As we discussed above, segmentation with CompositionalNets is treated as per-pixel binary classification between foreground and context on the feature layer . In order to perform amodal instance segmentation, both the visible and invisible mask of the object must be explicitly obtained. Therefore, we propose a third category for the per-pixel classification, , denoting the occluded pixels of the object.
Reasonably, these occluded pixels of the object have high compositional prior and low likelihood probability. Since we view occluded regions as unexplainable to our compositional representation instead of explicit occluders, we propose an outlier model, , such that its representation is broadly defined over the entire dataset, in an attempt to model any features vector unexplainable to the compositional representation. Here, has the same dimensions as a compositional representation at a particular position , namely . Thus, is calculated the same way as . This way, occlusion can be properly modeled by a high activation of the outlier model, compared to the compositional and context models. By combining the high compositional prior and low likelihood probability together, we formulate the probability that any feature vector is classified as an occluded object as below:
Since amodal segmentation is defined by the union of visible and invisible masks, amodal segmentation can be modeled as .
3.3 End-to-End Training
Overall, the trainable parameters of our models are , with ground truth modal bounding box and label as supervision. The loss function has two main objectives: 1) improve classification accuracy under occlusion (). 2) promote maximum likelihood for compositional and context representations (). 2) improve amodal segmentation quality ().
Training Classification with Regularization. We optimize the parameters jointly using SGD, where is the cross-entropy loss between the model output and the true class label .
Training the Generative Model with Maximum Likelihood. Here, we use to enforce a maximum likelihood for both the compositional and context representation over the dataset. Note that denote the mixture assignment that is inferred in the forward process and the outlier model is learned a priori and then fixed.
Training Segmentation with Regularization. This loss function that is based on the bounding box tightness prior is proposed by Hsu et al. (2019). Since by itself would motivate representations to focus on specific regions of the object instead of the complete object, proves to be significant, as it motivate representations to have a consistent explainability over the entire object.
Here, denote as the predicted mask in image space, and as the bounding box as supervision. is the set containing sweep rows and columns within the bounding box, while is the set containing sweep rows and columns directly outside the bounding box. Additionally, is the set containing all neighboring pixel pairs, while controls the trade-off between the two loss terms. Intuitively, is composed of two parts. First part is referred as the unary term, as it enforces every row or columns of pixels within the bounding box to contain at least one pixel that is recognized as a part of the predicted mask, while discouraging mask predictions outside of the bounding box. Second part is referred as the pairwise term, as it enforces pair-wise smoothness within the predicted mask.
End-to-end training. We train all parameters of our model end-to-end with the overall loss function:
while and controls the trade-off between the loss terms.
We perform experiments on semi-supervised amodal instance segmentation under both artificially-generated and real-world occlusion.
Datasets. While it is important to evaluate the approach on real images of partially occluded objects, simulating occlusion enables us to quantify the effects of partial occlusion more accurately. For the artificial dataset, we evaluated our approach on the OccludedVehiclesDetection dataset proposed by Wang et al. (2020). We remove the train category from evaluation due to the inaccurate mask annotations that only pertains to one segment of the train. The occlusion exists in both the object and its context by objects such as humans, animals and plants cropped from the MS-COCO dataset. The loss of contextual information increases the difficulty of amodal segmentation as the overall amodal structure of the object is removed. The OccludedVehiclesDetection contains 9 occlusion levels along two dimensions, which include three levels of object occlusion: FG-L1: 20-40%, FG-L2: 40-60% and FG-L3: 60-80% of the object area occluded, and three levels of context occlusion around the object: BG-L1: 0-20%, BG-L2: 20-40% and BG-L3: 40-60% of the contextual area occluded.
For the realistic dataset, we evaluate our approach on the KINS dataset proposed by Qi et al. (2019). Similar to the OccludedVehiclesDetection dataset, we split the objects into 3 object occlusion levels: FG-L1: 1-30%, FG-L2: 30-60% and FG-L3: 60-90%. We restrict the scope of the evaluation to vehicles that have a minimum amodal height of 50 pixels, as the significance of the segmentation quality decreases when the resolution of object reduces to too low.
CompositionalNets. We implement the end-to-end training of our proposed model with the following parameter settings: training minimizes loss described in Equation 3.3, with , and . We applied the Adam Optimizer proposed by Kingma and Ba (2014) with learning rate . Our proposed model is trained for a total of 1 epoch of iterations. The training costs in total of 2 hours on a machine with 1 NVIDIA TITAN Xp GPUs.
BBTP, proposed by Hsu et al. (2019), explores the bounding box tightness prior as its mechanism to generate segmentation mask with weak supervision. BBTP is trained for iterations, with a learning rate/decay, , . It is trained with non-occluded objects with amodal bounding boxes. Due to its weakly supervised nature, it is not possible to introduce occluder information into training, thus augmented training would not be plausible to implement.
PCNet-M , proposed by Zhan et al. (2020), learns amodal completion from artificially placing other objects in the dataset as occluders on the objects in a self-supervised manner given modal segmentation masks. It is trained for iterations, with a learning rate, , . Mask RCNN, proposed by He et al. (2017), serves as a modal segmentation network for PCNet-M. It is trained for iterations, with a learning rate/decay, , . Similarly, it is also trained with non-occluded objects. Due to its self-supervised amodal completion, augmented training is implied within the model’s construction. Therefore, PCNet-M is viewed as the fully supervised approach as oppose to our weakly supervised model.
Evaluation. As seen in the KINS dataset, the occlusion levels of objects are severely disproportional, observing over of the objects are non-occluded and less than of objects are in the highest occlusion level. Therefore, in order to examine the mask prediction quality as a function of occlusion levels, we evaluate with region proposals as supervision, in order to remove the bias to non-occluded objects and separate objects into subsets based on their occlusion level during evaluation. Since BBTP is only trained on complete amodal bounding boxes, it is unreasonable to evaluate it with modal bounding box. Therefore, it will be evaluated with amodal bounding boxes. On the other hand, since PCNet-M focuses its attention on self-supervision without occlusion annotation during training, PCNet-M will be evaluated with modal bounding boxes. In the end, we evaluate our approach in the same setting as both models separately.
4.1 Amodal Segmentation under Simulated Occlusion
PCNet-M. First, it is essential to note that PCNet-M requires the ground truth occluder segmentation mask. Furthermore, PCNet-M cannot reason about partial occlusions and amodal completion if the occluder category is unknown during training. In the case of the OccludedVehiclesDetection dataset, the occluders class labels are not given, thus it becomes necessary to given additional information to the PCNet-M. In contrast, our approach does not require any additional information regarding to the occluder during inference. From the results in Table 1we can observe that, although PCNet-M is trained with mask supervision, our approach is able to outperform the PCNet-M in amodal segmentation at higher object occlusions.
BBTP. The proposed model is able to outperform BBTP in amodal segmentation across all occlusion settings, including non-occluded objects. Hence our modal achieves state-of-the-art performance at weakly supervised amodal segmentation.
|FG Occ. Level||-||0||1||2||3||Mean|
|BG Occ. Level||-||-||1||2||3||1||2||3||1||2||3||-|
|FG Occ. Level||-||0||1||2||3||Mean|
4.2 Amodal Segmentation under Realistic Occlusion
Table 2 shows the results of the tested models on the KINS dataset and Figure 2 b refers to the qualitative results. Notably, a similar trend observed in the OccludedVehiclesDetection dataset is found in the KINS dataset with realistic occlusion.
PCNet-M. Seen in the table, PCNet-M outperforms CompositionalNets in lower levels of occlusion, but fails to perform amodal completion over large occluded regions in high level occlusion cases.
BBTP. Similarly observed as above, CompositionalNets exceeds in segmentation performance across all occlusion levels compared to BBTP.
In this work, we studied the problem of weakly-supervised amodal instance segmentation with partial bounding box annotations only. We made the following contributions to advance the state-of-the-art in weakly-supervised amodal instance segmentation: 1) We extend the Context-Aware CompositionalNets with innate spatial priors of the object shape to enable weakly-supervised amodal instance segmentation. 2) We enable CompositionalNets to predict the amodal bounding box of an object based on a modal (partial) bounding box, via maximum likelihood alignment of the partial feature representation with the internal object representation.3) We show that deep networks are capable of amodal perception, when they are augmented with compositional and spatial priors. Furthermore, we demonstrate that deep networks can learn the necessary knowledge in a weakly supervised manner from bounding box annotations only.
- K-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Cited by: §3.1.
- A sparse object coding scheme in area v4. Current Biology. Cited by: §1.
- Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §2.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.
- Weakly supervised instance segmentation using the bounding box tightness prior. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox and R. Garnett (Eds.), pp. 6586–6597. Cited by: §2, §3.3, §4.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
- Compositional convolutional neural networks: a deep architecture with innate robustness to partial occlusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8940–8949. Cited by: §1.
- Compositional convolutional neural networks: a deep architecture with innate robustness to partial occlusion. IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §3.1, §3.1.
- Compositional convolutional neural networks: a robust and interpretable model for object recognition under occlusion. arXiv preprint arXiv:2006.15538. Cited by: §3.2.
- Combining compositional models and deep networks for robust object classification under occlusion. arXiv preprint arXiv:1905.11826. Cited by: §1, §2.
- Amodal instance segmentation. In European Conference on Computer Vision, pp. 677–693. Cited by: §2.
- The importance of amodal completion in everyday perception. i-Perception 9 (4), pp. 2041669518788887. Cited by: §1.
- Occlusion-net: 2d/3d occluded keypoint localization using graph networks. IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §2.
- Amodal instance segmentation with kins dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.
- Toward a unified theory of visual area v4. Neuron. Cited by: §1.
- First-pass processing of value cues in the ventral visual pathway. Current Biology. Cited by: §1.
- Robust object detection under occlusion with context-aware compositionalnets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12645–12654. Cited by: §1, §2, §3.1, §3.2, §3.2, §3.2, §4.
- TDAPNet: prototype network with recurrent top-down attention for robust object classification under partial occlusion. arXiv preprint arXiv:1909.03879. Cited by: §2.
- Self-supervised scene de-occlusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.
- Occlusion-aware r-cnn: detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 637–653. Cited by: §2.
- Weakly supervised instance segmentation using class peak response. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3791–3800. Cited by: §2.
- Robustness of object recognition under extreme occlusion in humans and computational models. CogSci Conference. Cited by: §1, §2.
- Semantic amodal segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.