Multilayer Dense Connections for Hierarchical Concept Classification

Multilayer Dense Connections for Hierarchical Concept Classification


Classification is a pivotal function for many computer vision tasks such as object classification, detection, scene segmentation. Multinomial logistic regression with a single final layer of dense connections has become the ubiquitous technique for CNN-based classification. While these classifiers learn a mapping between the input and a set of output category classes, they do not typically learn a comprehensive knowledge about the category. In particular, when a CNN based image classifier correctly identifies the image of a Chimpanzee, it does not know that it is a member of Primate, Mammal, Chordate families and a living thing. We propose a multilayer dense connectivity for a CNN to simultaneously predict the category and its conceptual superclasses in hierarchical order. We experimentally demonstrate that our proposed dense connections, in conjunction with popular convolutional feature layers, can learn to predict the conceptual classes with minimal increase in network size while maintaining the categorical classification accuracy.

1 Introduction

Classification is a core concept for numerous computer vision tasks. Early works on Convolutional Neural Networks (CNN) [26, 34] revolutionized image classification techniques. In general, modern CNN architectures have roughly two components in their designs: 1) the feature computation layers that computes the convolutional maps at successive scales, and, 2) the classification layers that categorizes the target into multiple classes. Given the convolutional features, different architectures classify either the image itself [34, 20, 35], the region/bounding boxes for object detection [31, 28, 19, 29], or, at the granular level, pixels for scene segmentation [6, 7, 46]. Although early image recognition works employed multilayer classification layers [26, 34], the more recent models have all been using single layer dense connection [20, 21, 36, 35] or convolutions [31, 29].

The vision community has invented a multitude of techniques to enhance the capacity of feature computation layers [20, 21, 42, 36, 35, 25, 24, 9, 8, 44, 22, 33, 37]. But, the classification layer has mostly retained the form of a multinomial/softmax logistic regression performing a mapping from a set of inputs (images) to a set of categories/labels. As such, the networks themselves do not acquire a comprehensive knowledge about the input entity. In particular, when an existing CNN correctly identifies an image of an English Setter, the network itself does not learn that it is an instance of a dog, or more precisely, a hunting dog which is also a domestic animal and above all, a living thing. Acquiring such knowledge is essential for cognitive empowerment of CNNs. On a more broader perspective, it can be argued that encoding exhaustive information about object categories, and building upon it, is crucial for automation of intelligent systems.

Figure 1: The goal of the proposed algorithm. In contrast to the existing methods, our proposed CNN architecture predicts the chain of superclass concepts in addition to the final class category for each input.

Extensive information about most categories are freely available in repositories such as WordNet [16]. WordNet provides the hierarchical organization of category classes (e.g., English Setter) and their conceptual superclasses (e.g., Hunting dog, Domestic animal, Living thing). However, a surprisingly limited number of CNNs utilize the concept hierarchy. The primary goal of all existing studies is to improve the category-wise classification performance by exploiting the conceptual relations, often via a separate tool [12, 18, 23].

Deng et al. [12] apply a CRF to capture the interdependence among concept labels to improve category classification accuracy. Guo et al. [18] exploit a CNN-RNN architecture to model the hierarchical order of concepts; but, it is not clear how the hierarchy was generated and how the method scales with the of label set (e.g., entire ImageNet12 dataset). We have not found an existing work that attempts to predict an elaborate chain of ancestor concepts for an input image by a single network and reports performance on both concept and category classification.

In this paper, we introduce a CNN to classify the category and the concept superclasses simultaneously. As illustrated in Figure 1, in order to classify any category class (e.g., English Setter), our model is constrained to also predict the ancestor superclasses (e.g., Hunting dog, Domestic animal, Living thing) in the same order as defined in a given ontology. We propose a configuration of multilayer dense connections to predict the category & concept superclasses as well as model their interrelations based on the ontology. We also propose a simple method to prune and rearrange the label hierarchy for efficient connectivity.

Capturing the hierarchical relationship within the CNN architecture itself enables us to train the model end-to-end (as opposed to attaching a separate tool) by applying existing optimization strategies for training deep networks. We experimentally demonstrate that one can train the proposed architecture using standard optimization protocols to predict the concept classes with two popular CNN backbones, ResNet [20, 21] and InceptionV4 [35] while maintaining their category-wise accuracy. Perhaps more importantly, our model paves the way for learning the multilayer dense configurations capturing the label relationships from data using recent techniques for architecture learning [47, 30, 32].

As an image classifier, our proposed model can offer significantly more information to any subsequent downstream computation such as scene understanding [5, 40] at a cost of small increase in network size. In Section 6, we elaborate how our proposed approach is more elegant and far more advantageous than a look-up strategy from hard-coded ontology (i.e., oracle for concepts given category). We further discuss how our architecture can be extended to object detectors, both two-stage [19, 28] and SSD [31, 29] to compute the concept classes. In addition, we allude to a potential application of our model to capture label structures different from concept graph, e.g., spatial or compositional dependence.

2 Relevant works

Researchers have long believed in the importance of information embedded in the hierarchical label structure. Use of hierarchical classifiers can be traced back to the early works of [38, 41] and later to [17] that shared features for improved classification. Some have claimed such a hierarchical organization of categories resembles how human cognitive system stores knowledge [45] while others experimentally showed a correspondence between structure of semantic hierarchy in WordNet [16] and visual confusion between categories [11]. Later studies attempted to learn a label tree for efficient inference with low theoretical complexity [3, 13]. The works of [3, 13] also suggest a hierarchical representation of categories and their ancestor concepts might be beneficial with the really large datasets with tens of thousands of categories.

For CNN based classification, Deng et al. [12] modeled the relationships such as subsumption, overlap and exclusion among the categories via a CRF. Although the CRF parameter can be trained via gradient descent, the inference required a separate computation of message passing to compute the probability of the concept classes. The work of[14] extended this model by utilizing probabilistic label relationships. A more recent paper by Brust et al. [4] also apply probabilistic modeling of the relationship among categories.

In their paper, Guo et al. [18] attempted to classify the coarse labels or the conceptual superclasses of categories by augmenting an RNN to CNN output. Given the category-wise classification from a CNN, the recurrent layers identify the coarser labels. In addition to increased complexity imposed by the RNN, it is not clear how the hierarchy among labels was generated and how the hierarchy would scale up with increasing number of categories. The HDCNN framework [43] divides the image categories into coarse and fine labels. An example coarse category Aquatic animals is the superset of finer categories White shark, Hammerhead, Stingray etc. The framework comprises two modules for identifying coarse and fine categories where the coarse prediction modulates the layers for finer classification.

Hu et al. [23] discuss a hierarchical representation of coarse to fine scene context. In their structured inference model, the concepts representing scene attributes are predicted as indicator vectors of different length and a bidirectional message passing, inspired by the bidirectional recurrent networks, establishes the relations among different levels of concepts. The model leads to a large number of inter and intra-layer label interactions some which needed to be manually hard-coded to 0. In a similar line of thought, Liang  [27] employed graph based reinforcement learning to learn features of a set of modules, each of which corresponds to a concept. Once the network search is completed, the activated module outputs are passed through prediction layers for generating final output. The algorithm demonstrated promising performance for classifying scene contexts for semantic segmentation.

Unlike us, the primary objective of all the aforementioned papers is to improve the category prediction performance by utilizing the concept classes, along with their hierarchical dependence, as an auxiliary source of information or as intermediate result. Although some of the methods [12, 18] might be extended to classify the concept classes, none attempted to demonstrate their effectiveness for this task. Furthermore, in contrast to ours, most of these studies use a separate technique/tool for modeling the conceptual relations that need to be trained or applied separately with different mechanisms.

Commercial vision solutions such as Amazon Rekognition [2] or Google Vision [1] seem to detect objects with tags resembling to concept classes. As their website [2] suggests, the Amazon Rekognition deep learning model most probably does not explicitly understand the relationship between labels, e.g., dogs and animals. As the algorithms of [1, 2] are not public, we are unable to confirm whether or not they indeed predict concept classes and their relationships.

3 Proposed Method

Given an input image, the goal of our proposed method is to determine its category and a list of its concept superclasses. As can be inferred from the text, we denote the ancestors of a category at different levels of the label hierarchy as concepts. As an example, with an image of a Chimpanzee, the proposed algorithm produces predictions for 1) the category Chimpanzee, and 2) an ordered list of ancestor concepts: Living thing Chordate Mammal Primate Chimpanzee.

Our CNN architecture is designed to encompass the chain of relationships among the category and the predecessor concepts in the dense layers. We utilize an existing label hierarchy/ontology to guide the design of the dense layers, but do not use the hierarchy in prediction. In order to maximize the information within an ontology and to reduce the number of variables in the dense layers, we condense the original label hierarchy. Section 3.1 elaborates the rationale behind, as well as the technique used for, such compression.

In our design of multilayer dense connections, each concept is associated with a set of hidden nodes. These hidden nodes are connected to the concept and category output prediction nodes. Furthermore, the hidden nodes are also connected to those associated with the children concepts at the next lower level of the condensed ontology. For example, if Mammal, Bird and Reptile are the descendant concept of Chordate, there will be all to all connections from the hidden nodes representing Chordate to those accounting for Mammal, Bird and Reptile. Finally, the proposed model imposes another type of connections to enforce the hierarchical dependence among the concepts and the category nodes. The detailed design of our proposed multilayer hierarchical dense connections is described in Section 3.2. We illustrate (and experiment with) the proposed model for image classification in this paper.

Figure 2: Partial view of the original (left) and condensed (right) label hierarchies. Concepts are enclosed in rectangular boxes, with number of all descendeants in parentheses.

3.1 Condensed Concept Hierarchy

In general, a concept class decomposes to multiple sub-concepts in an ontology, e.g., ImageNet12 [10] subset of WordNet [16]. However, the lineage of a parent to a single child (e.g., Entity Physical Entity Abstraction ) is redundant and does not provide much information. Similarly, parent concepts with highly imbalanced distribution of descendants are not informative as well. Modelling the redundant and uninformative concepts will increase the network size with no information gain.

We reorganize the given ontology to reduce such redundancy. We assume the hierarchy to be a directed acyclic graph (DAG) and perform a depth first search (DFS) traversal on it. During the traversal, we first prune the label hierarchy based on the distribution of descendants of a concept node. Let denote the number of all descendants of a concept indexed by . We dissolve any child if , i.e., more that percentage of the descendants of are also descendants of This process is applied recursively to yield a balanced distribution of descendants of any concept in the resulting hierarchy.

In addition, we remove any concept in the structure with a descendant count and append the children set to those of its parent . Conceptually, it is not worth modeling a concept node with only a few descendants.

We depict the differences between the original and modified ontologies in Figure 2. The intermediate concept classes are plotted in this figure in boxes along with the total number of descendants whereas the category classes are shown in black circles. The distributions of the descendants for children concepts are more balanced in the compressed version (right) than that in the original version (left). As the network connections are dependent on the concept hierarchy, this reduction of nodes and relationships in the ontologies are crucial for our method. It is worth noting that the proposed modification added direct concept-category relations in the middle layers of the hierarchy.

Executing a DFS on a DAG ontology may lead to equivocal or ambiguous grouping of few concepts and categories as Deng et al.,[12] pointed out. We adopted DFS for simplicity to automate the compression, the main contribution of this paper is the dense architecture that performs the concept classification. We can replace the abridged graph with an unambiguous one whenever it is available. One can also employ the more efficient and elegant methods proposed in [13, 3] to generate the optimal ontology graph.

3.2 Network Architecture

Our proposed algorithm aims to model the abridged label hierarchy with dense connections. As Figure 2 suggests, there are multiple kinds of dense connectivities in our proposed classification layer. Each concept in the hierarchy corresponds to one set of hidden nodes which are connected to the hidden nodes representing its children, if any. The hidden nodes of a concept is also connected to the output prediction node for the concept itself and those for each of its child category nodes. An additional type of connectivity constrains the concept and category predictions to follow the hierarchical organization of the ontology. Section 3.2.1 and 3.2.2 explain the first two and constraint connections respectively, Section 3.2.3 describes the loss we exploit to learn the network.

Modeling Concepts and Categories

Let us denote by and the output prediction variable and the set of hidden nodes associated with the concept . The terms node and variables are used interchangeably in the description of our model. Let concept and category both be children of concept in the hierarchy and and denote the hidden and the output prediction variables for them respectively. The proposed model computes the output prediction and initial values , for quantities of the children concept and categories using the following dense connections.


In these equations, and are the weights/biases of the dense connectivity and corresponds to the -th value of . The activation functions utilized for these different quantities are .

Figure 3: Schematic view of proposed dense connections. The solid square and circle nodes correspond to the concept and category prediction node respectively, whereas the empty circles depicts the hidden nodes. We assume the concept has concepts and categories as children.

In our design, the number of nodes representing a concept is directly proportional to the total number of descendants of . We have used for this study with . The flattened output of the final feature layer of an existing network architecture (e.g., ResNet-50 or InceptionV4 etc.) is utilized to populate and its size depends on the particular architecture used. We do not predict the root concept of the hierarchy (e.g., Entity for ImageNet12) since all categories descend from it, i.e., we do not predict .

Concept Category Label Constraints

The values for category prediction and hidden nodes for child concept are calculated by multiplying initial values of these quantities with the concept prediction .


Note that, the Sigmoid activation constrains the value of to be . In effect, the node plays an excitatory or inhibitory role based on the predicted value of the concept . This constraint enforces that the nodes representing any child of concept , whether it is a category () or another downstream sub-concept (), be activated only if the concept itself is correctly predicted.

The category predictions for an input image are computed by applying Softmax activation over all category nodes , where N is the total number of categories. The predictions for concepts are given by the collection of the variables . The hierarchical relationship among the variables are enforced by construction. It is important to understand that, while an image can be classified to only one (e.g., Chimpanzee) of the categories, multiple concepts (e.g., Primate, Mammal, Chordate) at different levels of the hierarchy can be set to 1.

Figure 3 clarifies the proposed dense arrangement between the hidden nodes and its children concept nodes as well as the prediction outputs. The hidden nodes (shown in empty circles) are connected to those of its children concepts and category output variables (solid circles) to compute the initial quantities and respectively. The concept prediction (solid square) is computed by another dense connections which modulates the final values of and for the concept and category variables respectively via multiplication.

Loss Functions

The proposed method minimizes two different losses for the two types of output nodes. For the category predictions, we minimize a cross-entropy loss computed over the category labels and the network outputs . However, as more than one concepts may be detected for any input, a cross-entropy loss is not suitable for variables.

We introduce a binary indicator quantity for an input image such that if is an ancestor concept for the category and otherwise. With this indicator variable, the concept classification loss is defined as the MSE between and


The proposed method minimizes the combined loss . The balancing weight in the joint loss function has been fixed to in all our experiments. Note that, while the error for any category is backpropagated through its predecessor concepts due to the dependence imposed by construction, one needs to ensure that other concepts – that are not related to the category – to remain in the . This is exactly the constraint enforced by Equation 3.

Number of Variables

This section quantifies the increase in the number of weights in the dense layers induced by our multilayer computations. The CNN classifiers typically consists of connections in dense layer, where and are the size of the last feature layer and number of category classes respectively. In proposed multilayer dense connections with a balanced -way decomposition of the concepts, the total number of weights is where is the max layer of the hierarchy and is the fixed multiplier used to set the number of hidden nodes for concept (see Section 3.2.1).

In a balanced decomposition, the number of hidden nodes for concepts reduces by a constant factor . With the size of smallest set of hidden nodes as , the max layer of the dense connections is . For any concept prediction , we need weights at layer i.e., max number of weights for concept class prediction is . The number of concept-concept connections can be calculated as .

In order to predict a category variable , weights are necessary. Since any at any level in a balanced decomposition, the total number of weights for category class prediction must be .

4 Experimental Setup

We utilize the ontology provided by the ImageNet12 dataset [10] to design our dense layers. All the labels of ImageNet12 between correspond to the category classes and labels are assigned to the concept superclasses. In all our experiments, we fixed the two quantities for compressing the concept hierarchy (Section 3.1) to be , . After compressing the label hierarchy using the methods described in Section 3.1, there are concept labels left in the hierarchy which has a height of .

The proposed architecture has been trained on the ImageNet12 training dataset and tested on both ImageNet12 validation and the PASCAL VOC12 trainval datasets. As a feature extraction layers, we used two popular CNN architectures, namely, ResNet-50 [20, 21] and Inception V4 [36, 35]. We downloaded the implementations for both these models from the tensorflow website3 and modified the dense layers according the proposed method. Using the same sizes of inputs as the original models, i.e, for ResNet-50 and for InceptionV4, led to and initial hidden nodes respectively.

With these configurations, the proposed model increased the total number of variables of the ResNet-50 model by a factor of (M). For the InceptionV4 model, the increase is (M). This suggests that the increase stemming from our proposed model is far lower than the analytical estimate (provided in Section 3.2.4) in practice and is tolerable with respect to the overall network size. During training, we were able to fit the proposed model with the same batch size as the original model on the same GPU memory.

We evaluated the strength of our method in two settings: 1. learn only the proposed dense layers keeping feature layers fixed, and 2. learn the whole network from scratch. The details of the optimization strategy for training is different for these two types of learning and are explained in Section 5. The Tensorflow implementation of our method employs a distributed training over multiple GPUs. The combined batch size for any of the above machines was kept at and we use VGG-style data augmentation [34] (i.e., no bounding box information used).

During inference of the proposed network, we select the category with largest softmax probability for category classification as usual. One strategy to compute the chain of concept classes is to trace back the dense connections of the selected category. For generality, and to account for false positives generated by the algorithm, we do not backtrack through the dense layers and use values instead. In concept class predictions, we set any if the variable for its parent (i.e., lower than a confidence threshold). If more than one child of any concept is detected, we select the one with the highest confidence among them to compute the concept chain.

We report the single crop top-1 accuracy results for categories when they are available. When the proposed method correctly identifies a category class , it also classifies all it predecessor concepts by design. But, there is a possibility that the concepts not related to were misclassified to be 1. Therefore, we report the percentage of images with all predicted concepts strictly equal to the concept groundtruth , i.e., . In addition, we also compute the combined accuracy where the method was able to correctly classify both the category and concept chain. We omit the top-5 performance since it is not obvious how to compute it for concept recognition (or for extensions to object detection) due to the overlap in the superclass lineage.

Since no previous study directly classifies the concept hierarchy similar to our architecture, we devise a new baseline model to compare against. For each of the architectures tested, we build a baseline CNN by predicting classes with a single dense layer. The first outputs for category classes are computed by Softmax activation and trained with the cross-entropy loss. The following outputs computed with Sigmoid activation represent the concept predictions and are trained with the same concept loss formulated in Equation 3.

5 Results

5.1 Learning only Dense Layers

The first experiment demonstrates the effect on classification accuracy if one wants to replace the classification layer with the proposed multilayer dense classifier of an existing CNN. In this section, we report performances of the CNNs after training only the weights in multilayer dense connections (as noted in Equations 1) of the proposed network and weights of the baseline structure. All other weights were fixed at the values of the pretrained models of ResNet50 and InceptionV4 .

Training: We used the RMSProp optimizer with momentum for this experiment similar to [35, 8]. The initial learning rate for this experiment was and was multiplied by 0.94 every 2 epochs. The epoch size is the same as the training set size, momentum value , and weight decay . For proposed network learning, the weights corresponding to the concept outputs and the concept interconnections were learned for first 2 epochs before optimizing those for the category variables. We did not use label smoothing for training the Inception v4 model. The training was continued until the CNN achieved the same or close accuracy for category classification as reported in the original paper/repository.

Evaluation: Table 1 reports the accuracies , and of category, concept and combined classification respectively. As the table shows, both ResNet50 and InceptionV4 were able to match the classification performance for category classes with the baseline architecture. Provided with a fixed feature layers, it seems the single layer dense connections of baseline model was not sufficient to capture the dependence among concepts and categories. By enforcing a constraint to predict the chain of concept classes in order, the proposed multilayer structure achieved improvement in combined accuracy .

The results imply that the Inception V4 feature architecture is more informative than that of ResNet 50 for concept learning. More importantly, the proposed method is considerably better at detecting the order of conceptual superclasses than identifying category classes, . That is, our model can correctly identify the concept chain for many images whose finer category class was misclassified. In contrast, the of the baseline methods were substantially lower than .

ResNet50-Baseline 76.26 62.66 53.01
ResNet50-Proposed 75.9 79.84 68.83
InceptionV4-Baseline 80.11 72.68 63.56
InceptionV4-Proposed 79.91 87.95 77.71
Table 1: Accuracy comparison for learning dense layers only. All accuracies are computed for the single crop top-1 setting. The proposed method achieves significantly higher than the baseline for both architectures.

Analysis: Figure 4(a) shows some example categories where the proposed model with InceptionV4 backbone generated different concept orders than those in the condensed ontology. Monitor and Space bar are children of Equipment and Implement concepts respectively in the condensed ontology. But, they contain many images of Desktop computer and Computer keyboards – both of which are children of the Device concept. As a result, 68% and 88% of Monitor and Space bar images respectively were classified as Artifact Instrumentality Device. Similarly, although it is grouped to Matter Solid concepts, 75% of the Mushroom images were predicted as Living thing which consists of Earthstar, Stinkhorn categories that are visually very similar to Mushroom images. Motor scooter was placed in Wheeled vehicle in our compressed hierarchy; but at test time, 94% of the time its images was predicted as Self propelled vehicle.

In general, predictions from our proposed model concur more with the concept chains for descendants of Living thing than it does for those of Artifact concept. This perhaps suggests that concept lineages of Living things have fewer ambiguities than those of Artifacts in the condensed hierarchy.

Since different category classes are predicted at different layers of the proposed dense structure, it is rational to verify whether or not the category classification capability is impaired by the depth of layer. To test this, we have plotted in Figure 4(b) (top) the category prediction accuracies for each of the classes of the baseline against those of the proposed CNN with ResNet50 backbone. The plot implies no clear effect of the prediction depths on the classification performances on different categories as they remain same or very close to those of the original architecture.

(a) Concept misclassification examples
(b) distr. & progr.
Figure 4: \subrefF:FAILURE: Categories where the proposed method predicted a concept order different from the condensed hierarchy. \subrefF:ANALYSIS top: Category-wise classification accuracy of the proposed method vs the baseline architecture (w/ ResNet50). \subrefF:ANALYSIS bottom: Progession of validation accuracy of the proposed CNN (blue) and the baseline (red).

Due to the increased number of dense layers and additional sigmoid activations, it is perhaps natural to expect the proposed architecture to require more iteration to converge. As Figure 4(b)(bottom) demonstrates, our model indeed takes more epochs(x-axis) to attain a category classification performance (y axis) similar to that of baseline built upon ResNet50.

5.2 Learning Full Network from Scratch

In this section, we experiment with training all convolutional and dense layer weights of the the proposed network from scratch. One could anticipate the network to learn more informative features for capturing the class hierarchy information than those learned by the existing image classifiers to recognize only fine categories. We use the ResNet 50 architecture for this experiment.

Training: For training, we applied stochastic gradient descent (SGD) optimizer with momentum value 0.9 and initial learning rate 0.2. The learning rate was decreased every 30 epochs by a factor of 10 (similar to [34, 20] training). All other values of the hyperparameters remain the same as the last experiment; also the network was trained with only concept loss for first 2 epochs before optimization began on all variables in the CNN.

Evaluation: The accuracy values in Table 2 suggests learning the overall network could improve the performance of the the baseline network by more than for the combined measurement. However, the combined accuracy of the proposed method is still better than the baseline method.

ResNet50-Baseline 75.24 69.87 59.7
ResNet50-Proposed 73.5 78.14 65.74
Table 2: Accuracy comparison for learning full network from scratch. All accuracies are computed for the single crop top-1 setting. The proposed method achieves significantly higher than the baseline.

5.3 PASCAL VOC 2012 Dataset

In this section, we test the generalization of the knowledge learned by the proposed architecture. Ideally, the proposed network should be able to extrapolate its understanding of the concept superclasses learned from one category set to previously unseen categories. That is, after learning that a zebra is a mammal, it should be able to identify a horse (of any color) as a mammal too.

We use the PASCAL VOC 2012 dataset [15] to assess the ability of the proposed networks to generalize the knowledge learned about the concept lineage from ImageNet12. PASCAL VOC 2012 is a standard, well curated dataset with some loose connections, but no one to one correspondence, between its categories and those of the ImageNet124. We applied the aforementioned classifiers learned from the ImageNet12 dataset on the images of VOC 2012 trainval split.

Each of the VOC 2012 categories were assigned to one of the concepts in the condensed label hierarchy of ImageNet12 dataset we have been utilizing so far. We have excluded the classes {Person, Pottedplant} since these categories were underrepresented in the ImageNet12 dataset. For example, Person is represented by three rare subclasses Scuba diver, Ballplayer, Groom. On the other hand, the {Dog, Bird} categories were over represented in ImageNet12 and their subcategories were assigned to multiple concepts in our hierarchy. These classes could lead to inconsistency and therefore also excluded in evaluation. After removing the images that overlaps with multiple categories, we are left with 6941 images from 16 categories. For these images, we report the in Table 3 ( and cannot be computed as there is no one to one correspondence between categories of the two datasets).

Method cat cow horse sheep plane boat bike mbike bus train car botl chair dtable sofa montr avg
BaseRN 61.2 46.39 34.6 50.3 64.6 57.2 70.6 24.5 41.6 1.1 1.7 23.7 22.7 1 19.5 32.3 37.12
OurRN 74.2 76.2 51.2 73.3 83.2 62.7 79 7.1 47.9 3.2 3.3 24.5 33.7 8.5 37.9 37.8 46.21
BaseRN-S 59.1 50.8 41.9 56.0 77 53.1 72.5 40.5 63.1 1.5 2.2 21 20.5 3.1 16.1 35.9 40.29
OurRN-S 77.4 78.6 62.9 74.3 87.0 68.0 77.8 9.1 41.6 3.8 2.5 36.2 23 8.5 27.4 53.7 48.27
Base-IN 64.9 63.2 47.2 60.5 80.3 45.9 45 2.9 71 1.9 2.1 25.3 19.3 3.1 24 32.3 40.3
Our-IN 73.6 80.7 54.3 78.5 84.4 66.1 73.7 6.1 66.1 4.8 1.6 39.7 35.3 12.7 35.7 47.8 48.69
Table 3: Concept accuracy comparison on testing on PASCAL VOC 2012 trainval subset. RN: ResNet50 dense only, RN-S: ResNet50 from scratch, IN: InceptionV4. All accuracies are computed for the single crop top-1 setting. The proposed method achieves significantly higher than the baseline for both architectures.

The accuracy values for concept classification on VOC 2012 clearly indicate the superiority of the proposed architecture to generalize the knowledge it learned from ImageNet12 hierarchy of ancestor superclasses. All the CNN models resulted in weak performances of categories {Train, Car, Diningtable, Motorbike}. Such a performance can be attributed to equivocal ancestry within the condensed ontology. For the category Train, the proposed model with InceptionV4 architecture predicted Entity Artifact Instrumentality Container Wheeled vehicle Self propelled vehicle for 56% of images whereas the compressed ontology assigns it the concept order : Entity Artifact Instrumentality Conveyance. It is important to note though that for the ambiguous categories, both the baseline and the proposed models performed poorly – i.e., the proposed method is not drawing any inequitable advantage due to the label ambiguity.

6 Discussion

In this paper, we introduce multilayer dense connections for classifying both the category and the ordered chain of ancestor concept classes of an input image. We demonstrate the advantages of the proposed model experimentally with two popular architectures in different settings. The experimental results implies that, when augmented with the proposed dense layers, the existing CNN architectures can learn the lineage of concept superclasses without sacrificing the category-wise accuracy. The fact that one can train only the dense layers to extract the conceptual relationships (Section 5.1) strongly suggests that any existing CNN classifiers can conveniently gain this capacity by only modifying the final layers – using the same optimization strategy and without training the feature layers. Our analysis also indicates that the concept misclassifications can largely be attributed to the ambiguities in label hierarchy itself.

One may argue that the chain of the concepts can be read off from a look up table storing the label hierarchy. While this may sound like an easy solution computationally, there are multiple limitations of this strategy. First, a look up table can only predict the concept accurately when the category classification is correct. Whereas, as the Tables 1,  2, 3 indicate, our method can correctly identify the concept lineage even if the category is not correct (). This ability is useful in cases, e.g., a rattlesnake is misclassified in the finer category level but detected as a snake rather than an artifact. Second, our method can classify the concept lineage of an object whose category is has not been trained on, e.g., Horse in the PASCAL VOC 12 dataset. Third, and perhaps the most appealing advantage of the proposed method is that it enables the possibility of learning the concept organization using the network search methods [47, 32]. This capability is essential for attaining cognition by an intelligent machine.

Our dense connectivity can also be extended to both two stage and single shot of object detection algorithms. Modification of the two stage methods [19, 28] is obvious: the final detection head can be replaced by our connections. For SSD-type architectures [31, 29], the dense computations can be carried out by convolutions (or extended to ) and added to the classification branches at each scale. The convolutional form should conceptually enable the semantic segmentation techniques [6, 7, 46] to adopt our model as well.

Furthermore, rather than capturing the conceptual class lineage, the dense layers can be modeled after a different contextual relationship such as spatial or compositional consistency (via can-coexist-with [23] or is-part-of [4, 40] relations. These relationship graph, e.g., parsing graph [39], can be either precomputed or learned for the particular task at hand.

We believe this study will encourage researchers to conduct further research on structured prediction within CNN architectures.


  1. email:
  2. email:
  4. See


  1. Google vision.
  2. Amazon rekognition. (2016)
  3. Bengio, S., Weston, J., Grangier, D.: Label embedding trees for large multi-class tasks. In: NIPS (2010)
  4. Brust, C., Denzler, J.: Integrating domain knowledge: using hierarchies to improve deep classifiers. CoRR abs/1811.07125 (2018)
  5. Cao, Q., Liang, X., Li, B., Li, G., Lin, L.: Visual question reasoning on general dependency tree. CoRR abs/1804.00105 (2018)
  6. Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587 (2017)
  7. Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR abs/1802.02611 (2018)
  8. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1800–1807 (2016)
  9. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
  10. Deng, J., Dong, W., Socher, R., Li, L., Kai Li, Li Fei-Fei: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
  11. Deng, J., Berg, A.C., Li, K., Fei-Fei, L.: What does classifying more than 10,000 image categories tell us? In: ECCV (2010)
  12. Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., Adam, H.: Large-scale object classification using label relation graphs. In: ECCV (2014)
  13. Deng, J., Satheesh, S., Berg, A.C., Li, F.: Fast and balanced: Efficient label tree learning for large scale object recognition. In: NIPS (2011)
  14. Ding, N., Deng, J., Murphy, K., Neven, H.: Probabilistic label relation graphs with ising models. In: ICCV (2015)
  15. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88(2), 303–338 (Jun 2010)
  16. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. Language, Speech, and Communication, MIT Press, Cambridge, MA (1998)
  17. Fergus, R., Bernal, H., Weiss, Y., Torralba, A.: Semantic label sharing for learning with many categories. In: ECCV (2010)
  18. Guo, Y., Liu, Y., Bakker, E.M., Guo, Y., Lew, M.S.: Cnn-rnn: a large-scale hierarchical image classification framework. Multimedia Tools and Applications 77, 10251–10271 (2017)
  19. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV. pp. 2980–2988 (2017)
  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)
  21. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. CoRR abs/1603.05027 (2016)
  22. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
  23. Hu, H., Zhou, G., Deng, Z., Liao, Z., Mori, G.: Learning structured inference neural networks with label relations. CoRR abs/1511.05616 (2015)
  24. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
  25. Huang, G., Liu, Z., v. d. Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  26. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25 (2012)
  27. Liang, X.: Learning personalized modular network guided by structured knowledge. In: CVPR (June 2019)
  28. Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
  29. Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
  30. Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L., Fei-Fei, L., Yuille, A.L., Huang, J., Murphy, K.: Progressive neural architecture search. CoRR abs/1712.00559 (2017)
  31. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015)
  32. Pham, H., Guan, M., Zoph, B., Le, Q., Dean, J.: Efficient neural architecture search via parameters sharing. In: ICML (2018)
  33. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4510–4520 (2018)
  34. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
  35. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: ICLR 2016 Workshop (2016),
  36. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015)
  37. Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: ICML (2019)
  38. Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: efficient boosting procedures for multiclass object detection. In: CVPR (2004)
  39. Tu, Z., Chen, X., Yuille, A.L., Zhu, S.C.: Image parsing: Unifying segmentation, detection, and recognition. Int. J. Comput. Vision 63(2), 113–140 (Jul 2005)
  40. Wang, P., Wu, Q., Shen, C., van den Hengel, A., Dick, A.R.: FVQA: fact-based visual question answering. CoRR abs/1606.05433 (2016)
  41. Wu, J., Rehg, J.M., Mullin, M.D.: Learning a rare event detection cascade by direct feature selection. In: NIPS (2004)
  42. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  43. Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W., Yu, Y.: HD-CNN: Hierarchical deep convolutional neural networks for large scale visual recognition. In: ICCV. pp. 2740–2748 (2015)
  44. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017)
  45. Zhao, B., Li, F., Xing, E.P.: Large-scale category structure aware image categorization. In: NIPS (2011)
  46. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. CoRR abs/1612.01105 (2016)
  47. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. CoRR abs/1707.07012 (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description