Combining Compositional Models and Deep Networks For
Robust Object Classification under Occlusion
Deep convolutional neural networks (DCNNs) are powerful models that yield impressive results at object classification. However, recent work has shown that they do not generalize well to partially occluded objects and to mask attacks. In contrast to DCNNs, compositional models are robust to partial occlusion, however, they are not as discriminative as deep models. In this work, we combine DCNNs and compositional object models to retain the best of both approaches: a discriminative model that is robust to partial occlusion and mask attacks. Our model is learned in two steps. First, a standard DCNN is trained for image classification. Subsequently, we cluster the DCNN features into dictionaries. We show that the dictionary components resemble object part detectors and learn the spatial distribution of parts for each object class. We propose mixtures of compositional models to account for large changes in the spatial activation patterns (e.g. due to changes in the 3D pose of an object). At runtime, an image is first classified by the DCNN in a feedforward manner. The prediction uncertainty is used to detect partially occluded objects, which in turn are classified by the compositional model. Our experimental results demonstrate that combining compositional models and DCNNs resolves a fundamental problem of current deep learning approaches to computer vision: The combined model recognizes occluded objects, even when it has not been exposed to occluded objects during training, while at the same time maintaining high discriminative performance for non-occluded objects.
In natural images, objects are surrounded and partially occluded by other objects. Humans seem more robust to partial occlusion than current deep models  (see our studies in Section 4). One possible explanation is that it is unreasonable to assume that all possible occlusion patterns can be observed during training, because of their sheer number and variability. Hence, a major difference between computer vision and other machine learning tasks is that in computer vision we cannot assume that the training and test data are sampled from the same underlying distribution. Thus, when deployed in the real-world, a vision system must generalize well beyond the training data. For example it should be able to recognize objects robustly in previously unseen illumination conditions (daylight vs dawn), poses (walking vs yoga) or partial occlusions. Prominent examples of vision systems failing to achieve this kind of generalization include fatal accidents caused by driver-assistance systems classifying a truck in an unusual pose as sky  or failing to recognize a human that was partially occluded by a bicycle . In this work, we address the task of classifying objects under partial occlusion. We propose a compositional model that can reason about partial occlusion, and hence is able to recognize partially occluded objects even when it has not been exposed to partial occlusion during training. Furthermore, we combine compositional models with a deep neural network into a model that is highly discriminative while also being robust to partial occlusion.
Deep convolutional neural networks (DCNNs) are powerful discriminative models that yield impressive results at object classification [12, 15, 7]. However, recent work has shown that DCNNs do not generalize well when objects are partially occluded [18, 25] and when they are exposed to mask attacks - adversarial examples where parts of the image are masked out  (see also our experiments in Section 4). In contrast to deep models, compositional models have been shown to be robust to partial occlusion [6, 11], even if they have not seen partially occluded objects during training [18, 23]. Compositional models explicitly represent an object in terms of parts and their spatial composition into a whole. The key benefit of such a compositional representation is two-fold: 1) It makes possible to introduce an occlusion model that deactivates parts of the model, if they do not fit the data (i.e. if they are occluded by another object). 2) The model can potentially explain its classification result in terms of where it has detected an objects’ individual parts, as well as, where the object is occluded. However, the major limitation of compositional models is that they lack the discriminative ability of deep learning approaches, because they are optimized for modeling the whole data distribution and not for discriminating between individual samples. In this work, we propose to combine deep networks with compositional models, in order to get the best of both worlds, a highly discriminative model that is robust to partial occlusion and mask attacks. We make the following contributions in this paper:
Learning compositional models from DCNN features. In contrast to previous work which learns compositional models form the image pixels directly, we propose to learn them from DCNN features that are robust to nuisances such as illumination, background clutter and non-rigid deformations of parts. This enables us to represented complex objects in natural scenes, which is difficult to achieve with related approaches.
Generalization of compositional models to 3D objects. We propose to model 3D objects with mixtures of compositional models, where each mixture component represents a particular viewpoint or 3D structure of an object. Our experiments show that mixture models are superior in terms of classification performance compared to single compositional models.
Combining compositional models and deep networks. We propose to combine deep networks with compositional models into a model that retains high discriminative performance for non-occluded objects, while also being able to generalize well beyond what it has seen at training time in terms of partial occlusion. In our experiments, the proposed model outperforms a standard DCNN at classifying partially occluded objects by on the PASCAL3D+ dataset on MNIST digits and in absolute classification performance.
2 Related Work
Classification under partial occlusion. In the context of deep learning, Fawzi and Frossard  have shown that DCNNs are not robust to partial occlusion generated by masking out patches of the input image. In contrast to DCNNs, compositional models have been shown to be robust to partial occlusion. In particular they have been successfully applied for detecting partially occluded object parts [18, 23] and for recognizing simple 2D shapes under partial occlusion [6, 9, 11]. In this work, we propose a compositional model that can robustly classify 3D objects in natural scenes under strong partial occlusion.
Compositional object models. Related works on compositional models for object classification [8, 26, 5, 2, 10] have proposed to learn the model parameters directly from image pixels. The major challenge for these approaches is that their models need to explicitly account for nuisances such as illumination and object deformation in order to be robust to these nuisances. In this work, we propose to learn compositional models from the features of a DCNN. DCNN features at higher layers of the network have been shown to be robust w.r.t. variation in the illumination, shape an appearance of an object [24, 19, 18]. Hence, learning compositional model in terms of DCNN features instead of image pixels enables us to represent complex objects in natural scenes, without needing to model the underlying physical processes of the nuisances.
Combining compositional models and DCNNs. Liao et al.  propose to integrate the principles of compositionality into DCNNs by using a regularizer that encourages the feature representations of DCNNs to cluster during learning. They show that the resulting feature clusters resemble part detectors. Zhang et al.  show that part detectors can be encouraged in DCNNs by restricting the activations in feature maps to have a localized distribution. While these approaches have increased the explainability of the DCNN predictions, they have not been shown to enhance the robustness to partial occlusion. Related approaches propose to regularize the convolution filters to be sparse , or to enforce the activations in the feature maps to be disentangled for different objects . The key limitation of these approaches is that the compositional model is not explicit, but rather implicitly encoded within the parameters neural network. Thus, the resulting models remain black-box CNNs that are not robust to partial occlusion. In our proposed model the compositional model is explicit. Hence, it can be augmented with an occlusion model and become robust to partial occlusion, while also being able provide explanations of its’ predictions in terms of where it perceives an objects parts and where it thinks the object is occluded.
3 A Robust Model Combining Deep Networks and Compositional Models
In this section, we discuss how to combine compositional models and deep networks. We present a dictionary-based compositional model including details of how the parameters of the model can be learned from data in Section 3.1. In Section 3.2, we discuss how the compositional model can be made robust to partial occlusion. Finally, we discuss how a compositional model can be combined with a DCNN in Section 3.3.
3.1 A Dictionary-Based Compositional Model of DCNN Features
Our long-term goal is to learn a generative model of the DCNN features for an object class , but we make simplifications (see next paragraph). We define a feature map to be the output of a layer in a CNN. A feature vector is the vector of features in at position , where is defined on the 2D lattice of the feature map and is the number of channels in the layer. Note that the spatial information from the image is preserved in the feature maps, thus a position on corresponds to a patch in the image. We omit the subscript in the remainder of this section because the layer from which the features are extracted is fixed in our model (e.g. for the layer ).
Learning dictionaries of DCNN features. Modeling is difficult because the feature maps are high dimensional and real valued. We propose to encode the feature maps with a dictionary that is learned by clustering the vectors from the feature maps of all training image . We follow related work on learning dictionaries of DCNN features and use k-means for clustering [19, 18, 14]. In Figure 3, we illustrate some components of the learned dictionary by showing image patches that strongly activate this component. As previously observed in [19, 18], the dictionary components activate image patches that are similar in appearance and often even share semantic meanings. Note that the patches resemble image patterns that frequently re-occur for a particular class of images (e.g. Figure 2(a) & 2(b) for the class airplane). Therefore, we refer to the components as parts.
Learning the spatial activation patterns of parts. We encode the real valued feature vectors with a sparse binary vector by detecting the nearest neighbors of in the learned part dictionary using the cosine distance . Hence, the element if . Intuitively, encodes which parts of the dictionary are detected at position in the feature map . Therefore, we refer to the resulting binary matrix as part detection map. We found that a threshold of causes to be sparse, while also at least one component is active at every position in . We define a generative model of the part detection map as Bernoulli distribution:
Where is the probability that the part is active at position for the object class , and thus . Note that parts are assumed to be independently distributed which makes our model in spirit similar to bag of words models. However, the important difference is that the spatial position of the part detections are preserved in our model, hence capturing the spatial structure of the object.
Mixture of compositional models. Using the compositional model in Equation 1 we can represent 2D objects (e.g. MNIST) as spatial composition of part detections. However, we are not able to represent 3D objects well (see results in Section 4.2). The reason is that, due to independence assumption between parts in Equation 1, the model assumes that the spatial distribution of parts in is approximately the same. This assumption does not hold for 3D objects, because e.g. by changing the 3D pose of an object the relative spatial distribution of parts changes strongly (e.g. the location of the tires of a car in the image change between the side view and a frontal view). In order to resolve this problem, we introduce mixtures of compositional models:
The intuition is that each mixture component will represent images of an object that have approximately the same spatial part distribution (i.e. similar viewpoint and 3D structure). We learn the parameters of the Bernoulli distributions as well as the mixture assignment variables using maximum likelihood estimation while alternating between estimating and . This approach essentially assumes that the variability of part detection maps within each a mixture component is smaller than between the mixture components. To initialize the mixture assignments, we use spectral clustering with the hamming distance of the part detection maps of all training images . The intuition is that objects with a similar viewpoint and 3D structure will have similar part activation patterns, and thus should be assigned to the same mixture component. Figure 4 illustrates the resulting cluster assignment after ten iterations with clusters for different objects. Note that objects with different viewpoints and spatial structure (e.g. tandems) are approximately separated into different clusters.
3.2 Augmenting the Compositional Model with an Occlusion Model
In natural images, objects are surrounded and partially occluded by other objects. Partial occlusion of an object will change the part activation patterns in such that parts may be missing and other parts might be active at previously unseen location. The compositional model as described in Equation 1 does not take this into account and thus will be distorted by partial occlusion (see experiments in Section 4.2). However, modeling all of these “other objects” explicitly is computationally infeasible, because of their sheer number and variability. Hence, a common approach is to use an occlusion model , where occluders are collectively modeled as locally independent clutter. The intuition behind an occlusion model is that at each position in the image either the object model or a background model is active:
where . The binary variable indicates if the object is visible at position . The occlusion prior could be learned or alternatively be set manually (see Section 4). The background model is defined as: . Here we assume that the background model is independent of the position in the image and thus it has no spatial structure. We estimate the background model as by sampling part detection vectors on a set of background images that do not contain one of the objects of interest. The maximum likelihood estimate of the occlusion variables can be computed efficiently due to the independence assumption in the occlusion model (Equation 3). Figure 6 illustrates the positive values of the log-likelihood ratio between foreground and background model . Note that the model can localize the occluder well.
3.3 Combining Compositional Models and DCNNs
We combine the compositional model with the DCNN by first classifying an input image with both of their branches:
Our experiments show that the branches have complementary strengths and limitations. While the DCNN is highly discriminative for non-occluded objects, it performs poorly at classifying partially occluded objects, and vice-versa for the compositional model. Therefore, we combine both predictions into a final classification that retains the strengths of both branches, by setting when and else. Here, are the parameters of the DCNN and is a threshold. The intuition is that if the DCNN is uncertain about its prediction (i.e. is low), then the input image is likely to be misclassified (e.g. due to occlusion) and hence should rather be classified by the compositional model. Our experiments demonstrate that this approach successfully combines the complementary strengths of both branches.
We evaluate our model at the task of object classification on partially occluded MNIST digits  and vehicles from the PASCAL3D+ dataset . We simulate partial occlusion (Figure 5) by masking out patches in the images and filling them with random noise, textures, or constant white color . For the PASCAL3D+ vehicles we additionally use the images provided in the VehicleSemanticPart dataset , where partial occlusion was simulated by superimposing segmented objects over the target object (Figure 4(b)). Note that the objects used to simulate partial occlusion are different from the objects that the model has to discriminate. We define different occlusion levels which correspond to increasing amounts of occlusion based on the object segmentation masks provided in the PASCAL3D+ dataset as well as threshold segmentations of the MNIST digits. We quantify how recognizable the occluded objects are by reporting the average performance of five subjects that were asked to perform every type of experiment in Table 2 (total of human classifications).
Training details and parameter settings. We train and evaluate our models on the standard train/test splits as defined in the respective datasets. For the PASCAL3D+ data we follow the setup as proposed in . Thus, the task is to discriminate between 12 objects during training, while at test time the six vehicle categories are tested. If not differently stated, the models are trained on non-occluded objects, while at test time they are exposed to objects with different levels of partial occlusion. The DCNN has a VGG-16 architecture  and was pre-trained for object classification on the ImageNet dataset . For training the compositional model, all images are resized such that their short edge has a size of pixels. We extract the features form the layer of the DCNN. The mixture models have components. We learn dictionary components for each object class, thus the dictionary has for the MNIST dataset and components for the PASCAL3D+ dataset. We learn a background model for each of the four types of occluders and use a threshold of for the combination of the two branches. For experiments including an occlusion model, we use a prior of that is the same for all positions .
|PASCAL3D+ Classification under Occlusion|
|Occ. Area||0%||Level-1: 20-40%||Level-2: 40-60%||Level-3: 60-80%||Mean|
|MNIST Classification under Occlusion|
|Occ. Area||0%||Level-1: 20-40%||Level-2: 40-60%||Level-3: 60-80%||Mean|
|Training with Occlusion Bias on MNIST 20-40%|
4.1 DCNNs Do Not Generalize Well Under Partial Occlusion
The classification results in Table 2 show that the VGG network does not generalize well under partial occlusion, when it was not exposed to partially occluded objects during training. For the PASCAL3D+ data, the DCNN achieves a good performance for non-occluded objects and level-1 mask attacks. While for stronger levels of occlusion the performance drops by more than . Note that for natural occluders the performance decrease is much higher at level-1 and level-2 compared to mask attacks.
In large-scale datasets, we can expect that some amount of partial occlusion will be present in the data. However, it is well known that the variability in large datasets is often biased. Thus, the location of the partial occlusions might also be affected by dataset bias. We simulate this by training the DCNN with MNIST images with a combination of non-occluded images and images where the occluders occur only in the right half of the image at training time (VGG_R), while at test time they can occur all over the image. The classification results in Table 2 show that the DCNN can classify partially occluded objects well, when the partial occlusion occurs at locations it has observed during training (Right-Half). However, it cannot generalize well when the object is occluded at previously unseen spatial positions (Left-Half). We simulate an even more severe bias by restricting the occluders to also have a biased appearance (white masks only) in addition to having a biased location (VGG_R_W). We observe that the performance drops for previously unseen appearances (noise and textures) at all locations in the image, while it increases for the occluders with the same appearance at previously unseen positions (white masks in the left half). Hence, we observe a complex relation between biases in the training data and the classification performance that demands further studies.
Overall, our experiments show that DCNNs do not generalize well to previously unseen partial occlusion. However, it is important for computer vision systems to generalize away from the training data in terms of partial occlusion, because in real-world applications computer vision systems are almost always exposed to dataset bias in terms of partial occlusions.
4.2 The Proposed Model Classifies Partially Occluded Objects Robustly
PASCAL3D+. The results in Table 2 show that our proposed combination of compositional models and DCNNs outperforms the VGG network at classifying partially occluded objects for all levels and all types of occlusion, while retaining comparable performance for non-occluded objects. For level-1 mask attacks the performance of VGG and our combined model (CompOccMix+VGG) is comparable, while it becomes more prominent for level-2 and level-3 attacks with a mean absolute performance gain of and respectively. The absolute performance gain is even more prominent if the occluders are real objects (level-1: ; level-2: ; level-3:). Note that while our proposed model has not been exposed to partial occlusion at training time it is still able to classify partially occluded objects with exceptional accuracy.
MNIST. For the MNIST data we can observe similar generalization patterns as we have observed for PASCAL3D+. Our model is able to classify the partially occluded digits better than the VGG network, with a mean absolute performance gain of for level-1, for level-2 and for level-3 occlusions. Additionally, when the occlusions during training have a bias in the spatial positions and/or the appearance, our model generalizes much better to previously unseen partial occlusions than the VGG network (Table 2). Interestingly, the mixture of compositional models (CompOccMix) also provides a performance increase for the two dimensional MNIST digits compared to a single compositional model (CompOcc). In Figure 4, we show that each mixture focuses on a particular writing style of a digit, suggesting that it can better approximate the distribution of handwritten digits and hence is able to better discriminate between them.
In summary, we observe that a combination of compositional models and DCNNs generalizes much better to previously unseen data in terms of partial occlusion compared to using a standard DCNN only, while having comparable performance on data that is similarly distributed as the one observed during training.
Ablation study. Table 2 contains a series of ablation experiments on the PASCAL3D+ data. On average, single compositional models (Comp) as well as mixtures of compositional models (CompMix) perform as good as a DCNN. While they perform worse for images without occlusion and for level-1 occlusions, they are better for level-2 and level-3 occlusions compared to the DCNN. Hence, we can clearly observe the complementary strength and weakness of both types of models. When augmented with on occlusion model (CompOcc and CompMixOcc) the compositional models clearly outperform VGG in absolute performance by and respectively. Note that the mixture of compositional models performs superior compared to a single compositional model when they are augmented with an occlusion model. The combination of the VGG branch and the occlusion-aware mixture (CompOccMix+VGG) improves the performance for all experiments on partially occluded objects, while retaining comparable performance to the VGG model for non-occluded objects. Note the mutual benefit of integrating the two branches which improves the performance compared to each individual branch.
Explainability. An inherent property of compositional models is that it can explain the prediction result, in terms of where it perceives which object parts and where it thinks the object is occluded. We illustrate this property in Figure 6. For several test images we illustrate five parts which the compositional model has detected with highest likelihood (left) and shows some examples of image patches from the training images which activate the part model most (center). Using theses visualizations, the compositional model can provide an intuitive explanation of why it perceives a certain object in the input image.
Our extensive experimental results demonstrate that DCNNs cannot recognize partially occluded objects well, if they have not been exposed to partial occlusion during training. Even if they have been exposed to severe occlusion during training, they do not generalize well when the spatial distribution or the appearance of the occluders was biased. In order to resolve these fundamental limitations, we have proposed to combine compositional models and DCNNs. In this context, we made the following contributions:
Learning of compositional models from DCNN features. Previous work focused on learning compositional models from plain image pixels, which requires modeling of complex physical processes such as e.g. local deformation or illumination. DCNN features are robust to such nuisances. Hence, learning compositional models form DCNN features enables us to represent complex objects in natural scenes, which is difficult to achieve with related approaches.
Generalizing compositional models to 3D objects. We propose to use mixtures of compositional models for representing 3D objects. Our experimental results show that mixtures outperform single compositional models at object classification.
Combining compositional models and deep networks. We combine compositional models and DCNNs and demonstrate that they outperform a standard deep network at object classification under partial occlusion by on MNIST digits and on objects from the PASCAL3D+ dataset in absolute classification performance.
-  (2017) Technical report, u.s. department of transportation, national highway trafficsafety administration. Tesla Crash Preliminary Evaluation Report. Cited by: §1.
-  (2014) Unsupervised learning of dictionaries of hierarchical compositional models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2505–2512. Cited by: §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.
-  (2016) Measuring the effect of nuisance variables on classifiers. Technical report Cited by: §1, §2.
-  (2014) Learning a hierarchical compositional shape vocabulary for multi-class object representation. arXiv preprint arXiv:1408.5516. Cited by: §2.
-  (2017) A generative vision model that trains with high data efficiency and breaks text-based captchas. Science 358 (6368), pp. eaag2612. Cited by: §1, §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
-  (2006) Context and hierarchy in a probabilistic image model. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 2145–2152. Cited by: §2.
-  (2016) Probabilistic compositional active basis models for robust pattern recognition.. In BMVC, Cited by: §2.
-  (2017) Greedy structure learning of hierarchical compositional models. arXiv preprint arXiv:1701.06171. Cited by: §2.
-  (2017) Model-based image analysis for forensic shoe print recognition. Ph.D. Thesis, University_of_Basel. Cited by: §1, §2, §3.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §4.
-  (2016) Learning deep parsimonious representations. In Advances in Neural Information Processing Systems, pp. 5076–5084. Cited by: §2, §3.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.
-  (2017) Teaching compositionality to cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5058–5067. Cited by: §2.
-  (2016) Towards deep compositional networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3470–3475. Cited by: §2.
-  (2017) Detecting semantic parts on partially occluded objects. British Machine Vision Conference. Cited by: §1, §2, §2, §3.1.
-  (2015) Unsupervised learning of object semantic parts from internal states of cnns by population encoding. arXiv preprint arXiv:1511.06855. Cited by: §2, §3.1, §4, §4.
-  (2018) Why uber’s self-driving car killed a pedestrian. The Economist article. Note: https://www.economist.com/the-economist-explains/2018/05/29/why-ubers-self-driving-car-killed-a-pedestrianAccessed: 2019-05-21 Cited by: §1.
-  (2014) Beyond pascal: a benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision, pp. 75–82. Cited by: §4.
-  (2018) Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8827–8836. Cited by: §2.
-  (2018) DeepVoting: a robust and explainable deep network for semantic part detection under partial occlusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1372–1380. Cited by: §1, §2.
-  (2014) Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856. Cited by: §2.
-  (2019) Robustness of object recognition under extreme occlusion in humans and computational models. Cited by: §1, §1.
-  (2008) Unsupervised structure learning: hierarchical recursive composition, suspicious coincidence and competitive exclusion. In Computer vision–eccv 2008, pp. 759–773. Cited by: §2.