# Learning Structured Inference Neural Networks with Label Relations

###### Abstract

Images of scenes have various objects as well as abundant attributes, and diverse levels of visual categorization are possible. A natural image could be assigned with fine-grained labels that describe major components, coarse-grained labels that depict high level abstraction, or a set of labels that reveal attributes. Such categorization at different concept layers can be modeled with label graphs encoding label information. In this paper, we exploit this rich information with a state-of-art deep learning framework, and propose a generic structured model that leverages diverse label relations to improve image classification performance. Our approach employs a novel stacked label prediction neural network, capturing both inter-level and intra-level label semantics. We evaluate our method on benchmark image datasets, and empirical results illustrate the efficacy of our model.

## 1 Introduction

Standard image classification is a fundamental problem in computer vision -- assigning category labels to images. It can serve as a building block for many different computer vision tasks including object detection, visual segmentation, and scene parsing. Recent progress in deep learning [18, 29, 30, 31] significantly improved classification performance on large scale image datasets [25, 37, 1, 20]. Approaches typically assume image labels to be semantically independent and adapt either a multi-class or binary classifier to label images. In recent work [2, 5], deep learning methods that take advantage of label relations have been proposed to improve image classification performance.

However, in realistic settings, these label relationships could form a complicated graph structure. Take Figure 1 as an example. Various levels of interpretation could be formed to represent such an image. This image of a baseball scene could be described as an outdoor image at coarse level, or with a more concrete concept such as sports field, or with even more fine-grained labels such as batter’s box and objects such as grass, bat, person.

Models that incorporate semantic label relationships could be utilized to generate better classification results. The desiderata for these models include the ability to model label-label relations such as positive or negative correlation, respect multiple concept layers obtainable from sources such as WordNet, and to handle partially observed label data -- given a subset of accurate labels for this image, infer the remaining missing labels.

The contribution of this paper is in developing a structured inference neural network that permits modeling complex relations between labels, ranging from hierarchical to within-layer dependencies. We do this by defining a network in which a node is activated if its corresponding label is present in an image. We introduce stacked layers among these label nodes. These encode layer-wise connectivity among label classification scores, representing dependency from top-level coarse labels to bottom-level fine-grained labels. Activations are propagated bidirectionally and asynchronously on the label relation graph, passing information about the labels within or across concept layers to refine the labeling for the entire image.

## 2 Related Work

Multi-level labeling of images has been addressed in a number of frameworks. In this section we review relevant work within probabilistic, max-margin, multi-task, and deep learning.

Structured label prediction with external knowledge: Structured prediction approaches exist [32, 34], in which a set of class labels are predicted jointly under a fixed loss function. Traditional approaches learn graph structure as well as associated weights that best explain the training data (e.g., [3]). When external knowledge of label relations (e.g., a taxonomy) is available, it is beneficial to integrate this knowledge to guide the traditional supervised learning systems. For example, Grauman et al. [8] and Hwang et al. [11] took the WordNet category taxonomy to improve visual recognition. Johnson et al. [14] and McAuley and Leskovec [22] used metadata from a social network to improve image classification. Ordonez et al. [24] leveraged associated image captions (words of ‘‘naturalness’’) to estimate entry-level labels of visual objects.

Multi-label classification with label relations: Traditional multi-label classification cannot avoid predicting an image as both cat and dog, or an image as carnation but not flower. Using external knowledge of label relations, Deng et al. [2] proposed a representation, the HEX graph, to express and enforce exclusion, inclusion and overlap relations between labels in multi-label classification. This model was further extended for ‘‘soft’’ label relations using the Ising model by Ding et al. [5].

Structured model with convolutional neural networks (CNNs): Structured deep models extend traditional CNNs to applications of structured label prediction, for which the CNN model is found insufficient to learn implicit constraints or structures between labels. Structured deep learning therefore jointly learns a structured model with the CNN framework. For example, for human pose estimation, Tompson et al. [33] take the CNN predictions as unary potentials for body parts and feed them to a MRF-like spatial model, which further learns pairwise potentials of part relations. Schwing and Urtasun [28] proposed a structured deep network by concatenating a densely connected MRF model to a CNN for semantic image segmentation, in which the CNN provides unary potentials as the MRF model imposes smoothness. Deng et. al. [4] proposed a recurrent network that jointly performs message passing-style inference and learns graph structure for group activity recognition.

Our work combines these lines of work. We take the WordNet taxonomy as our external knowledge, expressing it as a label relation graph, and learning the structured labels within a deep network framework. Our contribution is in proposing a learning and inference algorithm that facilitates knowledge passing in the deep network based on label relations.

Multi-task joint learning: Multi-task learning follows the same spirit of structured label prediction, with the distinction that the outputs of multiple (different but related) tasks are estimated. Common jointly modeled tasks include segmentation and detection [19, 35], segmentation and pose estimation [16], or segmentation and object classification [21]. An emerging topic of joint learning is in image understanding and text generation by leveraging intra-modal correspondences between visual and human language [17, 15].

Our work can be naturally extended to multi-task learning, for which each layer of our model represents one task and the labels do not necessarily form a layered structure. Notably, we can improve existing multi-task learning methods by importing knowledge of intra-task label relations.

## 3 Method

Our model jointly classifies images in a layered label space with external label relations. The goal is to leverage the label relations to improve inference over the layered visual concepts.

We build our model on top of a state-of-the-art deep learning platform: given an image, we first extract CNN features from Krizhevsky et al. [18] as visual activations at each concept layer. Concept layers are stacked from fine-grained to coarser levels. Label relations are defined between consecutive layers and form a layered graph. Inference over the label relation graph is inspired by the recent success of Recurrent Neural Networks (RNNs) [10, 27], where we treat each concept layer as a timestep of a RNN. We connect neighboring timesteps to reflect the inter-layer label relations, while capturing intra-layer relations within each timestep. The label activations are propagated bidirectionally and asynchronously in the label relation graph to refine labeling for the given image. Figure 2 shows an overview of our classification pipeline.

We denote the collection of training images as , each with ground-truth label in every concept layer. We denote the labels of image as , where is the total number of concept layers. And each concept layer has labels. The CNN framework of Krizhevsky et al. [18] transforms each image into a 4096-dimensional feature vector, denoted as .

### 3.1 Learning Framework

It is straightforward to build an image classification model, by adding a loss layer on top of the CNN features for each concept layer. Specifically, the activations on concept layer are computed as

(1) |

where and are linear transformation parameters and biases to classify the labels at concept layer . Note that provides visual activation depending purely on the image . To generate label-specific probabilities, we can simply apply a sigmoid function (i.e., ) on the elements of .

This classification model does not accommodate label relations within or across concept layers. To leverage the benefit of label relations, we adopt an RNN-like inference framework. In the following, we first describe a top-down inference model, then a bidirectional inference model, and lastly propose our Structured Inference Neural Network, the SINN.

#### 3.1.1 Top-Down Inference Neural Network

Our model is inspired by the recent success of RNNs [9, 26], which make use of dynamic sequential information in learning. RNNs are called recurrent models because they perform the same computation for every timestep, with the input dependent on the current inputs and previous outputs. We apply a similar idea to our layered label prediction problem: we consider each concept layer as an individual timestep, and model the label relations within and across concept layers in the recurrent learning framework.

Specifically, at each timestep , we compute an image ’s activations based on two terms: , which are the activations from the last timestep , and , which are the activations from Eq. (1). The message passing process is defined as:

(2) |

where are the inter-layer model parameters capturing the label relations between two consecutive concept layers in top-down order, are the intra-layer model parameters to account for the label relations within each concept layer, and are the model biases. A sigmoid function can be applied to to obtain label-specific prediction probabilities for image .

Note that the inference process in Eq. (2) is different from the standard RNN learning: Eq. (2) unties and in each timestep, while the standard RNNs learn the same and parameters over and over on all timesteps.

To learn the model parameters ’s and ’s, we apply a sigmoid function function on the activations , and minimize the logistic cross-entropy loss with respect to ’s and ’s:

(3) | |||||

where is an indicator function which returns 1 if is true and 0 otherwise.

#### 3.1.2 BINN: Bidirectional Inference Neural Network

It makes more sense to model bidirectional inferences, as a concept layer is related to the two connected layers equally well. Therefore, we adopt the idea of bidirectional recurrent neural network [27], and propose the following bidirectional inference model:

(4) | |||||

(5) | |||||

(6) |

where Eqs. (4) and (5) proceed as top-down propagation and bottom-up propagation, respectively, and Eq. (6) aggregates the top-down and bottom-up messages into final activations for label prediction. Here and are aggregation model parameters, and we use the arrows and to indicate the directions of label propagation

As in the top-down inference model, the bidirectional inference model captures both inter-layer and intra-layer label relations in the model parameters ’s and ’s. For inter-layer relations, we connect a label in one concept layer to any label in its neighboring concept layers. For intra-layer relations, we model fully-connected relations within each concept layer. The model parameters ’s, ’s and ’s are learned by minimizing the cross-entropy loss defined in Eq. (3).

#### 3.1.3 SINN: Structured Inference Neural Network

The fully connected bidirectional model is capable of representing all types of label relations. In practice, however, it is hard to train a model on limited data due to the large number of free parameters. To avoid this problem, we use a structured label relation graph to restrict information propagation.

We use structured label relations of positive correlation and negative correlation as prior knowledge to refine the model. Here is the intuition: since we know that office is an indoor scene, beach is an outdoor scene, and indoor and outdoor are mutually exclusive, a high score on indoor should increase the probability of label office and decrease the probability of label beach. Labels that are not semantically related, e.g. motorcycle and shoebox, should not affect each other. The structured label relations can be obtained from semantic taxonomies, or by parsing WordNet relations [23]. We describe the details of extracting label relations in Section 4.

We introduce the notation , , and to explicitly capture structured label relations in between and within concept layers, where the superscripts and indicate positive and negative correlation, respectively. These model parameters are masked metrics capturing the label relations. Instead of learning full parametrized metrics of , , and , we freeze some elements to be zero if there is no semantic relation between the corresponding labels. For example, models the positive correlation in between two concept layers: only the label pairs that have positive correlation have learnable model parameters, while the rest are zeroed out to remove potential noise. A similar setting goes to , and . Figure 3 shows an example positive correlation graph and a negative graph between two layers.

To implement the positive and negative label correlation, we propose the following structured message passing process:

(9) |

Here stands for a ReLU activation function. It is essential for SINN as it enforces that activations from positive correlation always make positive contribution to output activation and keeps activations from negative correlation as negative contribution (notice the minus signs in Eqs (3.1.3) and (3.1.3)). To learn the model parameters ’s, ’s, and ’s, we optimize the cross-entropy loss in Eq. (3).

### 3.2 Label Prediction

Now we introduce the method of predicting labels in test images with our model. As the model is trained with multiple concept layers, it is straightforward to recognize a label at each concept layer for the provided test image. This mechanism is called label prediction without observation (the default pipeline shown in Figure 2).

A more interesting application is to make predictions with partial observations -- we want to predict labels in one concept layer given labels in another concept layers. Figure 4 illustrates the idea. Given an image shown in the left side of Figure 4, we have more confidence to predict it as batter box once we know it is an outdoor image with attribute sports field.

To make use of the partially observed labels in our SINN framework, we need to transform the observed binary labels into soft activation scores for SINN to improve the label prediction on the target concept layers. Recall that SINN minimizes cross-entropy loss which applies sigmoid functions on activations to generate label confidences. Thus, we reverse this process by applying the inverse sigmoid function on the binary ground-truth labels to obtain activations. Formally, we define the activation obtained from a ground-truth label as:

(10) |

Note that we put a small perturbation on the ground-truth label for numerical stability. In our experiments, we set .

### 3.3 Implementation Details

To optimize our learning objective, we use stochastic gradient descent with mini-batch size of 50 images and momentum of 0.9. For all training runs, we apply a two-stage policy as follows. In the first stage, we fixed pre-trained CNN networks, and train our SINN with a learning rate of 0.01 with fixed-size decay step. In the second stage, we set the learning rate as 0.0001 and fine-tune the CNN together with our SINN. We set the gradient clipping threshold to be 25 to prevent gradient explosion. The weight decay value for our training procedure is set to 0.0005.

## 4 Experiments

Concept Layer | Method | |||

28 taxonomy terms | CNN + Logistics | - | ||

CNN + BINN | - | |||

CNN + SINN | - | |||

50 animal classes | USE [12] + DECAF [6] | - | - | |

CNN + Logistics | ||||

CNN + BINN | ||||

CNN + SINN | ||||

85 attributes | CNN + Logistics | - | ||

CNN + BINN | - | |||

CNN + SINN | - |

We tested our method on three large-scale benchmark image datasets: the Animals with Attributes dataset (AwA) [20], the ADE20k dataset [1], and the SUN397 dataset [37]. Each dataset has different concept layers and label relation graphs. Experimental results show that (1) our method effectively boosts classification performance using the label relation graphs; (2) our SINN model consistently outperforms baseline classifiers and related methods in all experiments; and (3) particularly, the SINN model achieves significant performance gain with partial human labels.

Dataset and Label relation generation The AwA dataset contains an 85-attribute layer, a 50-animal-category layer and a 28-taxonomy-term layer. We extract the label relations from the WordNet taxonomy knowledge graph [8, 11, 12]. The NUS-WIDE dataset is composed of Flickr images with 81 object category labels, 698 image group labels from image metadata, and 1000 noisy tags collected from users. We parse WordNet to obtain label similarity, and threshold the soft similarity values into positive and negative correlation for the label graph. The SUN397 dataset has a typical hierarchical structure in label space, with 397 fine-grained scene categories on the bottom layer, 16 general scene categories on middle layer, and 3 coarsest categories on the top. Here the label relations are also extracted from WordNet.

Baseline. For each experiment, we compare our full method (CNN + SINN) with the baseline method: CNN + logistic regression. With further specifications, we may have extra baseline methods, such as CNN + BINN, CNN + logistic regression + extra tags, etc. We also compare our method with related state-of-the-art methods.

Evaluation metrics. We measure classification performance by mean average precision () in all comparisons. is a widely used metric for label-based retrieval and ranking. It measures the averaged performance over all label categories. In addition to , we also adopted various metrics for special cases.

In the case of NUS-WIDE, the task is multi-label classification. We adopt the setting of [14] and report mAP per label () and mAP per image () for easy comparison. For comparison with related works ( [22, 7, 14]) on NUS-WIDE, we also compute the per image and per label precisions and recalls. We abbreviate these metrics as for precision per label, for precision per image, for recall per label, and for precision per image.

For AwA and SUN397, we also compute the multi-class accuracy () and the intersection-over-union accuracy (). is a standard measurement for image classification problems. It averages per class accuracies as the final result. is a common prediction measurement for multi-label classification, based on the hamming distance of predicted labels to ground-truth labels.

### 4.1 AwA: Layered Prediction with Label Relations

This experiment demonstrates the label prediction capability of our SINN model and the effectiveness of adding structured label relations for label prediction. We run each method five times with five random splits -- 60% for training and 40% for test. We report the average performance as well as the standard deviation of each performance measure.

Note that there is very little related work with layered label prediction on AwA. The most relevant one is work by Hwang and Sigal [12] on unified semantic embedding (USE). The comparison is not strictly fair, as the train/test splits are different. Further, we include our BINN model without specifying the label relation graphs (see Section 3.1.2) as a baseline method in this experiment, as it can verify the performance gain in our model from including structure. The results are in Table 1.

Results. Table 1 shows that our method outperforms the baseline methods (CNN + Logistics and CNN + BINN variants) as well as the USE method, in terms of each concept layer and each performance metric. It validates the efficacy of our proposed model for image classification. Note that for the results in Table 1, we did not fine-tune the first seven layers of CNN [18] for fairer comparison with Hwang and Sigal [12] (which only makes use of DECAF features [6]). Fine-tuning the first seven CNN layers further improves at each concept layer to (28 taxonomy terms), (50 animal classes), (85 attributes), and to (28 taxonomy terms), (50 animal classes), (85 attributes), respectively.

### 4.2 NUS-WIDE: Multi-label Classification with Partial Human Labels of Tags and Groups

Method | ||||||
---|---|---|---|---|---|---|

Graphical Model [22] | - | - | - | - | - | |

CNN + WARP [7] | - | - | ||||

5k tags + Logistics [14] | ||||||

Tag neighbors + 5k tags [14] | ||||||

CNN + Logistics | ||||||

1k tags + Logistics | ||||||

1k tags + Groups + Logistics | ||||||

1k tags + Groups + CNN + Logistics | ||||||

1k tags + CNN + SINN | ||||||

1k tags + Groups + CNN + SINN |

This experiment shows our model’s capability to use noisy tags and structured tag-label relations to improve multi-label classification. The original NUS-WIDE dataset consists of 269,648 images collected from Flickr with 81 ground-truth concepts. As previous work used various evaluation metrics and experiment settings, and there are no fixed train/test splits, it is hard to make direct comparisons. Also note that a fraction of previously used images are unavailable now due to Flickr copyright.

In order to make our result as comparable as possible, we tried to set up the experiments according to previous work. We collected all available images and discard images with missing labels as previous work did [14, 7], and got 168,240 images of the original dataset. To make our result comparable with [14], we use 5 random splits with the same train/test ratio as [14] -- there are 132,575 training images and 35,665 test images in each split.

To compare our method with [22, 14], we also used the tags and metadata groups in our experiment. Different from their settings, instead of augmenting images with 5000 tags, we only used 1000 tags, and augment the image with 698 group labels obtained from image medatada to form a three-layer group-concept-tag graph. Instead of using the tags as sparse binary input features (as in [22, 14]), we convert them to observed labels and feed them to our model.

The baselines for comparison are as follows. As our usual baseline, we extract features from a CNN pretrained on ImageNet [25] and train a logistic classifier on top of it. In addition, we set up a group of baselines that make use of the groups and tags as binary indicator feature vectors for logistic regression. These baselines serve as the control group to evaluate the quality of metadata we used in SINN. Next, a stronger baseline that uses both CNN output and metadata vector with logistic classifier was evaluated. This method has a similar setting as that of the state-of-art method by Johnson et al. [14], with difference in visual feature (CNN on image in our method versus CNN on image neighborhood) and tag feature (1k tag vector versus 5k tag vector).

We report our results on this dataset with two settings for our SINN, the first using 1k tags as the only observations to a bottom level of the relation graph. This method provides a good comparison to the tag neighborhood + tag vector [14], as we did not use extra information other than tags. In the second setting, we make both group and tag levels observable to our SINN, which achieves the best performance. We also compared our results with that of McAuley et al. [22], Gong et al. [7]. The results are summarized in Table 2. Note that we did not report our performance with fine-tuning the first seven layers of the CNN in this table, so as to make direct comparison of structured inference on SINN with our baseline method CNN + Logistics. Fine-tuned CNN with SINN improves to and to .

Concept Layer | Method | |||

3 coarse scene categories | CNN + Logistics | - | ||

CNN + BINN | - | |||

CNN + SINN | - | |||

16 general scene categories | CNN + Logistics | - | ||

CNN + BINN | - | |||

CNN + SINN | - | |||

397 fine-grained scene categories | Image features + SVM [37, 36] | - | - | |

CNN + Logistics | ||||

CNN + BINN | ||||

CNN + SINN |

Method | ||
---|---|---|

Image features + SVM [37, 36] | - | |

CNN + Logistics | ||

CNN + BINN | ||

CNN + SINN | ||

CNN + Logistics + Partial Labels | ||

CNN + SINN + Partial Labels |

Results. Table 2 shows that our proposed method outperforms all baseline methods and existing approaches (e.g., [14, 7, 22]) by a large margin. Note that the results are not directly comparable due to different settings in train/test splits. However, the results show that, by modeling label relations between tags, groups and concepts, our model achieves dramatic improvement on visual prediction.

We visualize some results in Figure 5 showing exemplars on which our method improves over baseline predictions.

### 4.3 SUN397: Improving Scene Recognition with and without partially Observed Labels

We conducted two experiments on the SUN397 dataset. The first experiment is similar to the study on AwA: we applied our model to layered image classification with label relations, and compare our model with CNN + Logistics and CNN + BINN baselines, as well as a state-of-the-art approach [37, 36]. For fair comparison, we used the same train/test split ratio as [37, 36], where we have 50 training and test images in each of the 397 scene categories. To migrate the randomness in sampling, we also repeat the experiment 5 times and report the average performance as well as the standard deviations. The results are summarized in Table 3, showing that our proposed method again achieves a considerable performance gain over all the compared methods.

In the second experiment, we considered partially observed labels from the top (coarsest) scene layer as input to our inference framework. In other words, we assume we know whether an image is indoor, outdoor man-made, or outdoor natural. We compare the 397 fine-grained scene recognition performance in Table 4. We compare to a set of baselines, including CNN + Logistics + Partial Labels, that considers the partial labels as an extra binary indicator feature vector for logistic regression. Results show that our method combined with partial labels (i.e., CNN + SINN + Partial Labels) improves over baselines, exceeding the second best by 4% and 6% .

## 5 Conclusion

We have presented a structured inference neural network (SINN) for layered label prediction. Our model makes use of label relation graphs and concept layers to augment inference of semantic image labels. Beyond this, our model can be flexibly extended to consider partially observed human labels. We borrow the idea of RNNs to implement our SINN model, and combine it organically with an underlying CNN visual output. Experiments on three benchmark image datasets show the effectiveness of the proposed method in standard image classification tasks. Moreover, we also demonstrate empirically that label prediction is further improved once partially observed human labels are fed into the SINN.

## Acknowledgements

This work was supported by grants from NSERC and Nokia.

## References

- [1] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In CIVR, 2009.
- [2] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In ECCV. 2014.
- [3] J. Deng, S. Satheesh, A. C. Berg, and F. Li. Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS, 2011.
- [4] Z. Deng, A. Vahdat, H. Hu, and G. Mori. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. 2016.
- [5] N. Ding, J. Deng, K. Murphy, and H. Neven. Probabilistic label relation graphs with ising models. ICCV, 2015.
- [6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. ICML, 2013.
- [7] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for multilabel image annotation. ICLR, 2014.
- [8] K. Grauman, F. Sha, and S. J. Hwang. Learning a tree of metrics with disjoint visual features. In NIPS, 2011.
- [9] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS, 2008.
- [10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation (NC), 9(8):1735--1780, 1997.
- [11] S. J. Hwang, K. Grauman, and F. Sha. Semantic kernel forests from multiple taxonomies. In NIPS, 2012.
- [12] S. J. Hwang and L. Sigal. A unified semantic embedding: Relating taxonomies and attributes. In NIPS, 2014.
- [13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
- [14] J. Johnson, L. Ballan, and F.-F. Li. Love thy neighbors: Image annotation by exploiting image metadata. ICCV, 2015.
- [15] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. CVPR, 2015.
- [16] P. Kohli, J. Rihan, M. Bray, and P. H. Torr. Simultaneous segmentation and pose estimation of humans using dynamic graph cuts. International Journal of Computer Vision (IJCV), 79(3):285--298, 2008.
- [17] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? text-to-image coreference. In CVPR, 2014.
- [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- [19] M. Kumar, P. Ton, and A. Zisserman. Obj cut. In CVPR, 2005.
- [20] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(3):453--465, 2014.
- [21] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with an implicit shape model. In ECCV Workshop, 2004.
- [22] J. McAuley and J. Leskovec. Image labeling on a network: using social-network metadata for image classification. In ECCV. 2012.
- [23] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM (CACM), 38(11):39--41, 1995.
- [24] V. Ordonez, J. Deng, Y. Choi, A. C. Berg, and T. Berg. From large scale image categorization to entry-level categories. In ICCV, 2013.
- [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), pages 1--42, 2015.
- [26] H. Sak, A. W. Senior, and F. Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech, 2014.
- [27] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing (TSP), 45(11):2673--2681, 1997.
- [28] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint arXiv:1503.02351, 2015.
- [29] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. ICLR, 2014.
- [30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- [32] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.
- [33] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
- [34] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6:1453--1484, 2005.
- [35] B. Wu and R. Nevatia. Simultaneous object detection and segmentation by boosting local shape feature based classifier. In CVPR, 2007.
- [36] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva. Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision (IJCV), pages 1--20, 2014.
- [37] J. Xiao, J. Hays, K. Ehinger, A. Oliva, A. Torralba, et al. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
- [38] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS, 2014.