Logic Tensor Networks for
Semantic Image Interpretation
Abstract
Semantic Image Interpretation (SII) is the task of extracting structured semantic descriptions from images. It is widely agreed that the combined use of visual data and background knowledge is of great importance for SII. Recently, Statistical Relational Learning (SRL) approaches have been developed for reasoning under uncertainty and learning in the presence of data and rich knowledge. Logic Tensor Networks (LTNs) are an SRL framework which integrates neural networks with firstorder fuzzy logic to allow (i) efficient learning from noisy data in the presence of logical constraints, and (ii) reasoning with logical formulas describing general properties of the data. In this paper, we develop and apply LTNs to two of the main tasks of SII, namely, the classification of an image’s bounding boxes and the detection of the relevant partof relations between objects. To the best of our knowledge, this is the first successful application of SRL to such SII tasks. The proposed approach is evaluated on a standard image processing benchmark. Experiments show that the use of background knowledge in the form of logical constraints can improve the performance of purely datadriven approaches, including the stateoftheart Fast Regionbased Convolutional Neural Networks (Fast RCNN). Moreover, we show that the use of logical background knowledge adds robustness to the learning system when errors are present in the labels of the training data.
1 Introduction
Semantic Image Interpretation (SII) is the task of generating a structured semantic description of the content of an image. This structured description can be represented as a labelled directed graph, where each vertex corresponds to a bounding box of an object in the image, and each edge represents a relation between pairs of objects; verteces are labelled with a set of object types and edges are labelled by the binary relations. Such a graph is also called a scene graph in [15].
A major obstacle to be overcome by SII is the socalled semantic gap [19], that is, the lack of a direct correspondence between lowlevel features of the image and highlevel semantic descriptions. To tackle this problem, a system for SII must learn the latent correlations that may exist between the numerical features that can be observed in an image and the semantic concepts associated with the objects. It is in this learning process that the availability of relational background knowledge can be of great help. Thus, recent SII systems have sought to combine, or even integrate, visual features obtained from data and symbolic knowledge in the form of logical axioms [30, 4, 8].
The area of Statistical Relational Learning (SRL), or Statistical Artificial Intelligence (StarAI), seeks to combine datadriven learning, in the presence of uncertainty, with symbolic knowledge [29, 2, 13, 7, 26, 23]. However, only very few SRL systems have been applied to SII tasks (c.f. Section 2) due to the high complexity associated with image learning. Most systems for solving SII tasks have been based, instead, on deep learning and neural network models. These, on the other hand, do not in general offer a wellfounded way of learning from data in the presence of relational logical constraints, requiring the neural models to be highly engineered from scratch.
In this paper, we develop and apply for the first time, the SRL framework called Logic Tensor Networks (LTNs) to computationally challenging SII tasks. LTNs combine learning in deep networks with relational logical constraints [27]. It uses a Firstorder Logic (FOL) syntax interpreted in the real numbers, which is implemented as a deep tensor network. Logical terms are interpreted as feature vectors in a realvalued dimensional space. Function symbols are interpreted as realvalued functions, and predicate symbols as fuzzy logic relations. This syntax and semantics, called real semantics, allow LTNs to learn efficiently in hybrid domains, where elements are composed of both numerical and relational information.
We argue, therefore, that LTNs are a good candidate for learning SII because they can express relational knowledge in FOL which serves as constraints on the datadriven learning within tensor networks. Being LTN a logic, it provides a notion of logical consequence, which forms the basis for learning within LTNs, which is defined as best satisfiability, c.f. Section 4. Solving the best satisfiability problem amounts to finding the latent correlations that may exist between a relational background knowledge and numerical data attributes. This formulation enables the specification of learning as reasoning, a unique characteristic of LTNs, which is seen as highly relevant for SII.
This paper specifies SII within LTNs, evaluating it on two important tasks: (i) the classification of bounding boxes, and (ii) the detection of the partof relation between any two bounding boxes. Both tasks are evaluated using the PASCALPart dataset [5]. It is shown that LTNs improve the performance of the stateoftheart object classifier Fast RCNN [11] on the bounding box classification task. LTNs also outperform a rulebased heuristic (which uses the inclusion ratio of two bounding boxes) in the detection of partof relations between objects. Finally, LTNs are evaluated on their ability to handle errors, specifically misclassifications of objects and partof relations. Very large visual recognition datasets now exist which are noisy [24], and it is important for learning systems to become robust to noise. LTNs were trained systematically on progressively noisier datasets, with results on both SII tasks showing that LTN’s logical constraints are capable of adding robustness to the system, in the presence of errors in the labels of the training data.
The paper is organized as follows: Section 2 contrasts the LTN approach with related work which integrate visual features and background knowledge for SII. Section 3 specifies LTNs in the context of SII. Section 4 defines the best satisfiability problem in this context, which enables the use of LTNs for SII. Section 5 describes in detail the comparative evaluations of LTNs on the SII tasks. Section 6 concludes the paper and discusses directions for future work.
2 Related Work
The idea of exploiting logical background knowledge to improve SII tasks dates back to the early days of AI. In what follows, we review the most recent results in the area in comparison with LTNs.
Logicbased approaches have used Description Logics (DL), where the basic components of the scene are all assumed to have been already discovered (e.g. simple object types or spatial relations). Then, with logical reasoning, new facts can be derived in the scene from these basic components [19, 21]. Other logicbased approaches have used fuzzy DL to tackle uncertainty in the basic components [14, 6, 1]. These approaches have limited themselves to spatial relations or to refining the labels of the objects detected. In [8], the scene interpretation is created by combining image features with constraints defined using DL, but the method is tailored to the partof relation and cannot be extended easily to account for other relations. LTNs, on the other hand, should be able to handle any semantic relation. In [18, 10], a symbolic Knowledgebase is used to improve object detection, but only the subsumption relation is explored and it is not possible to inject more complex knowledge using logical axioms.
A second group of approaches seeks to encode background knowledge and visual features within probabilistic graphical models. In [30, 20], visual features are combined with knowledge gathered from datasets, web resources or annotators, about object labels, properties such as shape, colour and size, and affordances, using Markov Logic Networks (MLNs) [25] to predict facts in unseen images. Due to the specific knowledgebase schema adopted, the effectiveness of MLNs in this domain is evaluated only for Horn clauses, although the language of MLNs is more general. As a result, it is not easy to evaluate how the approach may perform with more complex axioms. In [2], a probabilistic fuzzy logic is used, but not with real semantics. Clauses are weighted and universallyquantified formulas are instantiated, as done by MLNs. This is different from LTNs where the universallyquantified formulas are computed by using an aggregation operation, which avoids the need for instantiating all variables.
In other related work, [4, 16] encode background knowledge into a generic Conditional Random Field (CRF), where the nodes represent detected objects and the edges represent logical relationships between objects. The task is to find a correct labelling for this graph. In [4], the edges encode logical constraints on a knowledgebase specified in DL. Although these ideas are close in spirit to the approach presented in this paper, they are not formalised as in LTNs, which use a deep tensor network and firstorder logic, rather than CRFs or DL. In general, the logical theory behind the functions to be defined in the CRF is unclear. In [16], potential functions are defined as text priors such as cooccurrence of terms found in the image descriptions of Flickr.
In a final group of approaches, here called languagepriors, background knowledge is taken from linguistic models [22, 17]. In [22], a neural network is built integrating visual features and a linguistic model to predict semantic relationships between bounding boxes. The linguistic model is a set of rules derived from WordNet [9], stating which types of semantic relationships occur between a subject and an object. In [17], a similar neural network is proposed for the same task but with a more sophisticated language model, embedding in the same vector space triples of the form subjectrelationobject, such that semantically similar triples are mapped closely together in the embedding space. In this way, even if no examples exist of some triples in the data, the relations can be inferred from similarity to more frequent triples. A drawback, however, is the possibility of inferring inconsistent triples, such as e.g. maneatschair, due to the embedding. LTNs avoid this problem with a logicbased approach (in the above example, with an axiom to the effect that chairs are not normally edible). LTNs can also handle exceptions, offering a system capable of dealing with crisp axioms and realvalued data, as specified in what follows.
3 Logic Tensor Networks
Let be a firstorder logic language, whose signature is composed of three disjoint sets , and , denoting constants, functions and predicate symbols, respectively. For any function or predicate symbol , let denote its arity. Logical formulas in allow one to specify relational knowledge, e.g. the atomic formula , stating that object is a part of object , the formulae , stating that the relation is asymmetric, or , stating that every cat should have a tail. In addition, exceptions are handled by allowing formulas to be interpreted in fuzzy logic, such that in the presence of an example of, say, a tailless cat, the above formula can be interpreted naturally as normally, every cat has a tail; this will be exemplified later.
Semantics of : We define the interpretation domain as a subset of , i.e. every object in the domain is associated with a dimensional vector of real numbers. Intuitively, this tuple represents numerical features of an object, e.g. in the case of a person, their name in ASCII, height, weight, social security number, etc. Functions are interpreted as realvalued functions, and predicates are interpreted as fuzzy relations on real vectors. To emphasise the fact that we interpret symbols as real numbers, we use the term grounding instead of interpretation^{1}^{1}1In logic, the term grounding indicates the operation of replacing the variables of a term or formula with constants or terms that do not contain other variables. To avoid any confusion, we use the synonym instantiation for this purpose. It is worth noting that in LTN, differently from MLNs, the instantiation of every first order formula is not required. in the following definition of semantics.
Definition 1
Let . An grounding, or simply grounding, for a FOL is a function defined on the signature of satisfying the following conditions:

for every constant symbol ;

for every ;

for every .
Given a grounding , the semantics of closed terms and atomic formulas is defined as follows:
The semantics for connectives is defined according to fuzzy logic; using for instance the Lukasiewicz tnorm^{2}^{2}2Examples of tnorms include Lukasiewicz, product and Gödel. The Lukasiewicz tnorm is , product tnorm is , and Gödel tnorm is . See [3] for details.:
The LTN semantics for is defined in [27] using the operator, that is, , where is the set of instantiated terms of . This, however, is inadequate for our purposes as it does not tolerate exceptions well (the presence of a single exception to the universallyquantified formulae, such as e.g. a cat without a tail, would falsify the formulae. Instead, our intention in SII is that the more examples there are that satisfy a formulae , the higher the truthvalue of should be. To capture this, we use for the semantics of a meanoperator, as follows:
where for . ^{3}^{3}3The popular mean operators, arithmetic, geometric and harmonic mean, are obtained by setting and , respectively.
Finally, the classical semantics of is uniquely determined by the semantics of , by making equivalent to . This approach, however, has a drawback too when it comes to SII: if we adopt, for instance, the arithmetic mean for the semantic of then . Therefore, we shall interpret existential quantification via Skolemization: every formula of the form is rewritten as , by introducing a new ary function symbol, called Skolem function. In this way, existential quantifiers can be eliminated from the language by introducing Skolem functions.
Formalizing SII in LTNs: To specify the SII problem, as defined in the introduction, we consider a signature , where is the set of identifiers for all the bounding boxes in all the images, , and , where is a set of unary predicates, one for each object type, e.g. , and is a set of binary predicates representing relations between objects. Since in our experiments we focus on the partof relation, . The FOL formulas based on this signature can specify (i) simple facts, e.g. the fact that bounding box contains a cat, written , the fact that contains either a cat or a dog, written , etc., and (ii) general rules such as .
A grounding for can be defined as follows: each constant , denoting a bounding box, can be associated with a set of geometric features and a set of semantic features obtained from the output of a bounding box detector. Specifically, each bounding box is associated with geometric features describing the position and the dimension of the bounding box, and semantic features describing the classification score returned by the bounding box detector for each class. For example, for each bounding box , , is the vector:
where the last four elements are the coordinates of the topleft and bottomright corners of , and is the classification score of the bounding box detector for .
An example of groundings for predicates can be defined by taking a onevsall multiclassifier approach, as follows. First, define the following grounding for each class (below, is the vector corresponding to the grounding of a bounding box):
(1) 
Then, a simple rulebased approach for defining a grounding for the relation is based on the naïve assumption that the more a bounding box is contained within a bounding box , the higher the probability should be that is part of . Accordingly, one can define as the inclusion ratio of bounding box , with grounding , into bounding box , with grounding (formally, ). A slightly more sophisticated rulebased grounding for (used as baseline in the experiments to follow) takes into account also type compatibilities by multiplying the inclusion ratio by a factor . Hence, we define as follows:
(2) 
for some threshold (we use ), and with if is a part of , and otherwise. Given the above grounding, we can compute the grounding of any atomic formula, e.g. , , , , , thus expressing the degree of truth of the formula. The rulebased groundings (Eqs. (1) and (2)) may not satisfy some of the constraints to be imposed. For example, the classification score may be wrong, a bounding box may include another which is not in the partof relation, etc. Furthermore, in many situations, it is not possible to define a grounding a priori. Instead, groundings may need to be learned automatically from examples, by optimizing the truthvalues of the formulas in the background knowledge. This is discussed next.
4 Learning as Best Satisfiability
A partial grounding, denoted by , is a grounding that is defined on a subset of the signature of . A grounding is said to be a completion of , if is a grounding for and coincides with on the symbols where is defined.
Definition 2
A grounded theory GT is a pair with a set of closed formulas and a partial grounding .
Definition 3
A grounding satisfies a GT if completes and for all . A GT is satisfiable if there exists a grounding that satisfies .
According to the previous definition, deciding the satisfiability of amounts to searching for a grounding such that all the formulas of are mapped to 1. Differently from the classical satisfiability, when a GT is not satisfiable, we are interested in the best possible satisfaction that we can reach with a grounding. This is defined as follows.
Definition 4
Let be a grounded theory. We define the best satisfiability problem as the problem of finding a grounding that maximizes the truthvalues of the conjunction of all clauses , i.e.
Grounding captures the latent correlation between the quantitative attribute of objects and their categorical and relational properties. Not all functions are suitable as a grounding; they should preserve some form of regularity. If (the bounding box with feature vector contains a cat) then for every close to (i.e. for every bounding box with features similar to ), one should have . In particular, we consider groundings of the following form:
Function symbols are grounded to linear transformations. If is a ary function symbol, then is of the form:
where is the ary vector obtained by concatenating each . The parameters for are the real matrix and the vector .
The grounding of an ary predicate , namely , is defined as a generalization of the neural tensor network (which has been shown effective at knowledge completion in the presence of simple logical constraints [28]), as a function from to , as follows:
(3) 
with the sigmoid function. The parameters for are: , a 3D tensor in , , and . This last parameter performs a linear combination of the quadratic features given by the tensor product. With this encoding, the grounding (i.e. truthvalue) of a clause can be determined by a neural network which first computes the grounding of the literals contained in the clause, and then combines them using the specific tnorm.
In what follows, we describe how a suitable GT can be built for SII. Let be a set of bounding boxes of images correctly labelled with the classes that they belong to, and let each pair of bounding boxes be correctly labelled with the partof relation. In machine learning terminology, is a training set without noise. In real semantics, a training set can be represented by a theory , where contains the set of closed literals (resp. ) and (resp. ), for every bounding box labelled (resp. not labelled) with and for every pair of bounding boxes connected (resp. not connected) by the relation. The partial grounding is defined on all bounding boxes of all the images in where both the semantic features and the bounding box coordinates are computed by the Fast RCNN object detector [11]. is not defined for the predicate symbols in and is to be learned. contains only assertional information about specific bounding boxes. This is the classical setting of machine learning where classifiers (i.e. the grounding of predicates) are inductively learned from positive examples (such as ) and negative examples () of a classification. In this learning setting, mereological constraints such as “cats have no wheels” or “a tail is a part of a cat” are not taken into account. Examples of mereological constraints state, for instance, that the partof relation is asymmetric (), or lists the several parts of an object (e.g. )), or even, for simplicity, that every whole object cannot be part of another object (e.g. and every part object cannot be divided further into parts (e.g. . This general knowledge is available from online resources, such as WordNet [9], and can be retrieved by inheriting the meronymy relations for every concept correponding to a whole object. A grounded theory that considers also mereological constraints as prior knowledge can be constructed by adding such axioms to . More formally, we define , where , and is the set of mereological axioms. To check the role of , we evaluate both theories and then compare results.
5 Experimental Evaluation
We evaluate the performance of our approach for SII^{4}^{4}4LTN has been implemented as a Google TensorFlowlibrary. Code, ontology, and dataset are available at https://gitlab.fbk.eu/donadello/LTN_IJCAI17 on two tasks, namely, the classification of bounding boxes and the detection of relations between pairs of bounding boxes. In particular, we chose the partof relation because both data (the PASCALPartdataset [5]) and ontologies (WordNet) are available on the partof relation. In addition, partof can be used to represent, via reification, a large class of relations [12] (e.g., the relation “a plant is lying on the table” can be reified in an object of type “lying event” whose parts are the plant and the table). However, it is worth noting that many other relations could have been included in this evaluation. The time complexity of LTN grows linearly with the number of axioms.
We also evaluate the robustness of our approach with respect to noisy data. It has been acknowledged by many that, with the vast growth in size of the training sets for visual recognition [15], many data annotations may be affected by noise such as missing or erroneous labels, nonlocalised objects, and disagreements between annotations, e.g. human annotators often mistake “partof” for the “have” relation [24].
We use the PASCALPartdataset that contains 10103 images with bounding boxes annotated with objecttypes and the partof relation defined between pairs of bounding boxes. Labels are divided into three main groups: animals, vehicles and indoor objects, with their corresponding parts and “partof” label. Whole objects inside the same group can share parts. Whole objects of different groups do not share any parts. Labels for parts are very specific, e.g. “left lower leg”. Thus, without loss of generality, we have merged the bounding boxes that referred to the same part into a single bounding box, e.g. bounding boxes labelled with “left lower leg” and “left upper leg” were merged into a single bounding box of type “leg”. In this way, we have limited our experiments to a dataset with 20 labels for whole objects and 39 labels for parts. In addition, we have removed from the dataset any bounding boxes with height or width smaller than 6 pixels. The images were then split into a training set with 80%, and a test set with 20% of the images, maintaining the same proportion of the number of bounding boxes for each label.
Object Type Classification and Detection of the PartOf Relation: Given a set of bounding boxes detected by an object detector (we use FastRCNN), the task of object classification is to assign to each bounding box an object type. The task of PartOf detection is to decide, given two bounding boxes, if the object contained in the first is a part of the object contained in the second. We use LTN to resolve both tasks simultaneously. This is important because a bounding box type and the partof relation are not independent. Their dependencies are specified in LTN using background knowledge in the form of logical axioms.
To show the effect of the logical axioms, we train two LTNs: the first containing only training examples of object types and partof relations (), and the second containing also logical axioms about types and partof (). The LTNs were set up with tensor of layers and a regularization parameter . We chose Lukasiewicz’s Tnorm () and use the harmonic mean as aggregation operator. We ran 1000 training epochs of the RMSProp learning algorithm available in TensorFlow. We compare results with the Fast RCNN at object type classification (Eq.(1)), and the inclusion ratio baseline (Eq.eq:grBpof) at the partof detection task^{5}^{5}5 A direct comparison with [4] is not possible because their code was not available.. If is larger than a given threshold (in our experiments, =0.7) then the bounding boxes are said to be in the relation. Every bounding box is classified into if . With this, a bounding box can be classified into more than one class. For each class, precision and recall are calculated in the usual way. Results for indoor objects are shown in Figure 3 where AUC is the area under the precisionrecall curve. The results show that, for both object types and the partof relation, the LTN trained with prior knowledge given by mereological axioms has better performance than the LTN trained with examples only. Moreover, prior knowledge allows LTN to improve the performance of the Fast RCNN (FRCNN) object detector. Notice that the LTN is trained using the Fast RCNN results as features. FRCNN assigns a bounding box to a class if the values of the corresponding semantic features exceed . This is local to the specific semantic features. If such local features are very discriminative (which is the case in our experiments) then very good levels of precision can be achieved. Differently from FRCNN, LTNs make a global choice which takes into consideration all (semantic and geometric) features together. This should offer robustness to the LTN classifier at the price of a drop in precision. The logical axioms compensate this drop. For the other object types (animals and vehicles), LTN has results comparable to FRCNN: FRCNN beats by 0.05 and 0.037 AUC, respectively, for animals and vehicles. Finally, we have performed an initial experiment on small data, on the assumption that the LTN axioms should be able to compensate a reduction in training data. By removing 50% of the training data for indoor objects, a similar performance to with the full training set can be achieved: 0.767 AUC for object types and 0.623 AUC for the partof relation, which shows an improvement in performance.
Robustness to Noisy Training Data: In this evaluation, we show that logical axioms improve the robustness of LTNs in the presence of errors in the labels of the training data. We have added an increasing amount of noise to the PASCALPartdataset training data, and measured how performance degrades in the presence and absence of axioms. For , we randomly select of the bounding boxes in the training data, and randomly change their classification labels. In addition, we randomly select of pairs of bounding boxes, and flip the value of the partof relation’s label. For each value of , we train LTNs and and evaluate results on both SII tasks as done before. As expected, adding too much noise to training labels leads to a large drop in performance. Figure 6 shows the AUC measures for indoor objects with increasing error . Each pair of bars indicates the AUC of , for a given of errors.
Results indicate that the LTN axioms offer robustness to noise: in addition to the expected overall drop in performance, an increasing gap can be seen between the drop in performance of the LTN trained with exampels only and the LTN trained including background knowledge.
6 Conclusion and Future Work
SII systems are required to address the semantic gap problem: combining visual lowlevel features with highlevel concepts. We argue that the problem can be addressed by the integration of numerical and logical representations in deep learning. LTNs learn from numerical data and logical constraints, enabling approximate reasoning on unseen data to predict new facts. In this paper, LTNs were shown to improve on stateoftheart method Fast RCNN for bounding box classification, and to outperform a rulebased method at learning partof relations in the PASCALPartdataset. Moreover, LTNs were evaluated on how to handle noisy data through the systematic creation of training sets with errors in the labels. Results indicate that relational knowledge can add robustness to neural systems. As future work, we shall apply LTNs to larger datasets such as Visual Genome, and continue to compare the various instances of LTN with SRL, deep learning and other neuralsymbolic approaches on such challenging visual intelligence tasks.
References
 [1] J. Atif, C. Hudelot, and I Bloch. Explanatory reasoning for image understanding using formal concept analysis and description logics. Systems, Man, and Cybernetics: Systems, IEEE Transactions on, 44(5):552–570, May 2014.
 [2] S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hingeloss markov random fields and probabilistic soft logic. CoRR, abs/1505.04406, 2015.
 [3] M. Bergmann. An Introduction to ManyValued and Fuzzy Logic: Semantics, Algebras, and Derivation Systems. Cambridge University Press, 2008.
 [4] N. Chen, Q.Y. Zhou, and V. Prasanna. Understanding web images by object relation network. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pages 291–300, New York, NY, USA, 2012. ACM.
 [5] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
 [6] S. Dasiopoulou, Y. Kompatsiaris, and M.l G. Strintzis. Applying fuzzy dls in the extraction of image semantics. J. Data Semantics, 14:105–132, 2009.
 [7] Michelangelo Diligenti, Marco Gori, and Claudio Saccà. Semanticbased regularization for learning and inference. Artificial Intelligence, 2015.
 [8] I. Donadello and L. Serafini. Integration of numeric and symbolic information for semantic image interpretation. Intelligenza Artificiale, 10(1):33–47, 2016.
 [9] C. Fellbaum, editor. WordNet: an electronic lexical database. MIT Press, 1998.
 [10] G. Forestier, C. Wemmert, and A. Puissant. Coastal image interpretation using background knowledge and semantics. Computers & Geosciences, 54:88–96, 2013.
 [11] R. Girshick. Fast rcnn. In International Conference on Computer Vision (ICCV), 2015.
 [12] N. Guarino and G. Guizzardi. On the reification of relationships. In 24th Italian Symp. on Advanced Database Sys., pages 350–357, 2016.
 [13] B. Gutmann, M. Jaeger, and L. De Raedt. Extending problog with continuous distributions. In Proc. ILP, pages 76–91. Springer, 2010.
 [14] C. Hudelot, J. Atif, and I. Bloch. Fuzzy spatial relation ontology for image interpretation. Fuzzy Sets and Systems, 159(15):1929–1951, 2008.
 [15] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.L. Li, D. Shamma, M. Bernstein, and Li F.F. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
 [16] G.h Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating image descriptions. In CVPR, 2011.
 [17] C. Lu, R. Krishna, M. Bernstein, and L FeiFei. Visual relationship detection with language priors. In ECCV, pages 852–869, 2016.
 [18] M. Marszalek and C. Schmid. Semantic Hierarchies for Visual Object Recognition. In CVPR, 2007.
 [19] B. Neumann and R. Möller. On scene interpretation with description logics. Image and Vision Computing, 26(1):82 – 101, 2008. Cognitive VisionSpecial Issue.
 [20] D. Nyga, F. BalintBenczedi, and M. Beetz. Pr2 looking at thingsensemble learning for unstructured information processing with markov logic networks. In IEEE Intl. Conf.on Robotics and Automation, pages 3916–3923, 2014.
 [21] I. S. Espinosa Peraldi, A. Kaya, and R. Möller. Formalizing multimedia interpretation based on abduction over description logic aboxes. In Proc. of the 22nd Intl. Workshop on Description Logics, volume 477 of CEUR Workshop Proceedings. CEURWS.org, 2009.
 [22] V. Ramanathan, C. Li, J. Deng, W. Han, Z. Li, K. Gu, Y. Song, S. Bengio, C. Rosenberg, and L. FeiFei. Learning semantic relationships for better action retrieval in images. In CVPR, 2015.
 [23] I. Ravkic, J. Ramon, and J. Davis. Learning relational dependency networks in hybrid domains. Machine Learning, 100(23):217–254, 2015.
 [24] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. CoRR, abs/1412.6596, 2014.
 [25] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(12):107–136, 2006.
 [26] T. Rocktaschel, S. Singh, and S. Riedel. Injecting logical background knowledge into embeddings for relation extraction. In NAACL, 2015.
 [27] L. Serafini and A. S. d’Avila Garcez. Learning and reasoning with logic tensor networks. In Proc. AI*IA, pages 334–348, 2016.
 [28] R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems, pages 926–934, 2013.
 [29] J. Wang and P. Domingos. Hybrid markov logic networks. In AAAI, volume 8, pages 1106–1111, 2008.
 [30] Y. Zhu, A. Fathi, and L. FeiFei. Reasoning about object affordances in a knowledge base representation. In ECCV, pages 408–424. 2014.