What is (missing or wrong) in the scene? A Hybrid Deep Boltzmann Machine For Contextualized Scene Modeling
Abstract
Scene models allow robots to reason about what is in the scene, what else should be in it, and what should not be in it. In this paper, we propose a hybrid Boltzmann Machine (BM) for scene modeling where relations between objects are integrated. To be able to do that, we extend BM to include triway edges between visible (object) nodes and make the network to share the relations across different objects. We evaluate our method against several baseline models (Deep Boltzmann Machines, and Restricted Boltzmann Machines) on a scene classification dataset, and show that it performs better in several scene reasoning tasks.
I Introduction
Modeling (representing) their environments is crucial for cognitive as well as artificial agents. For a robot, scene modeling pertains to representing a scene in such a way that the robot can reason about the scene and the objects in it in an efficient manner. A scene model should allow for the robot to check, for example, (i) whether there is a a certain object in the scene and where it is, or (ii) whether it is in the right place in the scene or (iii) whether there is something redundant in the scene to be moved to some place else.
Although there are many studies on scene modeling in robotics and computer vision (e.g., [1, 2, 3, 4]), to the best of our knowledge, ours is the first that uses a multiway Boltzmann Machines for the task.
Boltzmann Machines (BM) [5] are stochastic generative models that offer many benefits for various modeling problems. The benefits of Boltzmann Machines include (among others) the presence of latent nodes, which function as context variables modulating the object activations, the ease to extend with the requirements of scene modeling, and its generative capability. Although BM existed beforehand, they became popular again with extensions to deep architectures or restricted connections (i.e., Restricted Boltzmann Machines).
Ia Related Work
Scene Modeling: Many models have been proposed for scene modeling in computer vision and robotics using probabilistic models such as Markov or Conditional Random Fields [6, 4, 1, 2, 7], Bayesian Networks [8, 3], Latent Dirichlet Allocation variants [9, 10], Dirichlet and Beta processes [11], chaingraphs [12], predicate logic [13, 14], and Scene Graphs [15]. There have also been many attempts for ontologybased scene modeling where objects and various types of relations are modeled [14, 16, 17].
Among these, [1, 2, 3, 4] use context either in representing the scene or solving a task using the scene model for a robotics problem. These studies model context via local interactions between visible variables, except for [2] who proposed using Latent Dirichlet Allocation for modeling context.
Relation Estimation and Reasoning: Early studies on integrating relations into scene modeling and analysis tasks were rulebased. These approaches defined relations using rules based on 2D/3D distances between objects, e.g., [18]. With advances in probabilistic graphical modeling, many approaches used models such as Markov Random Fields [6, 19], Conditional Random Fields [7], Implicit Shape Models [20], latent generative models [11]. Many studies also proposed formulating relation detection as a classification problem, e.g., using logistic regression [21], and deep learning [22].
IB Contributions
The main contributions of our work are the following:

Deep Boltzmann Machines for Scene Modeling: We use Deep Boltzmann Machines (DBM) for modeling a scene in terms of objects and the relations between the objects. To the best of our knowledge, this is the first study that uses DBM with relations for the task.

A Hybrid Triway Model  DBM with relations: Adding relations to DBM is not straightforward since there may be different relations between objects and the same relations between different objects should represent the same thing. This leads to two extensions: (i) Triway nodes to represent relations in the DBM, (ii) Weightsharing between the weights of relation nodes to enforce relations between different objects to represent the same relations.
We evaluate our extended DBM model on many practical robot problems: Determining (i) what is missing in a scene, (ii) relations between objects, (iii) what should not be in a scene, (iv) random scene generation given some objects or relations from the tobegenerated scene. We compare our model(Triway BM) against DBM[23] with 2way relations(GBM), and Restricted Boltzmann Machines (RBM) [24].
Ii Background: General, Restricted and Higherorder Boltzmann Machines
A Boltzmann Machine (BM)^{1}^{1}1Although this is textbook material, it is essential for us to be able to describe our extensions. [5] is a graphical model composed of visible nodes and hidden nodes – see also Figure 2. In a BM, hidden nodes are connected to other hidden nodes with bidirectional weights, ; visible nodes to other visible nodes with ; and hidden nodes to visible nodes with . With these connections, a BM tries to obtain an estimation of from a sample of the training data.
For a BM, one can define a scalar, representing the negative harmony between the nodes given current weights:
(1) 
BM is inspired from physical systems which favor states with lower energies, and therefore, the probability of being in a certain state (i.e., ) is linked to the energy of the system via Boltzmann distribution:
(2) 
where is called the partitioning function. Since the partitioning function is intractable to calculate for real problems, is iteratively estimated by stochastically activating nodes in the network with probability based on the change in the energy of the system after the update:
(3) 
where is a visible or a hidden node; is the change in energy of the system if is turned on; and is the temperature of the system, gradually decreased (annealed) to a low value, controlling how stochastic the updates are.
Since training is rather slow and limiting in BM, its restricted version (Restricted Boltzmann Machines) with only connections between hidden and visible nodes have been proposed [24]. In a Deep Boltzmann Machine [23], on the other hand, there are layers of hidden nodes. See Figure 2 for a schematic comparison of the alternative models.
Some problems require the edges to combine more than two nodes at once, which have led to the Higherorder Boltzmann Machines (HBM) [25]. With HBM, one can introduce edges of any order to link multiple nodes together.
Iia Training a Boltzmann Machine
Training a BM minimizes the KullbackLeibler divergence between , the distribution over when data is clamped on the visible nodes (called the positive phase), and , the distribution obtained when the network is run freely (called the negative phase). Taking the gradient of the divergence with respect to the weights leads to the following update rule:
(4) 
where and are the expected joint activations of nodes and during the positive phase and the negative phase, respectively; and is a learning rate.
Iii A Triway Hybrid Boltzmann Machine for Scene Modeling
As shown in Figure 1, we extend Boltzmann Machines by adding relational (visible) nodes that (i) are shared across objects, and (ii) link two objects together with a single triway edge. In other words, a relation connects two objects, and with a weight . The overall energy of the hybrid BM then is updated as follows:
(5)  
where the changes compared to the energy definition in Equation 1 are highlighted in red. Note that the definition in Equation 5 uses triway edges with the relation nodes, and that relations (in fact, weights of relations) are shared across objects.
Weightsharing suggests that, e.g., a left relation between and , and a left relation between and () represent the same relation. In order to do that, the gradients on the weights of relation that is coming from all pairs of objects in the scene are aggregated:
(6) 
where is the set of object tuples connected by relation .
Iiia Training and Inference
In order to make training faster, we dropped the connections between the hidden neurons.
For training our Triway Hybrid BM, in the positive phase, as usual, we clamp the visible units with the objects, and the relations between the objects and calculate for any edge in the network.
In negative phase, firstly, object units are sampled with a twostep Gibbs sampling by using activation of hidden units and relation units. In this way, relation units also contribute to activation of object units, in addition to the hidden units. Then, the relation units are sampled from recently sampled object units and hidden units. We calculate for any edge in the network in these two steps.
For training the networks, we used gradient descent with a batch size of 32 with early stopping (i.e., training process is finished when validation accuracy begins to decrease). Learning rate and temperature are empirically set to 0.5 and 1 respectively and 2 hidden layers is used that the bottom layer has 200 hidden units and the top layer has 100 hidden layers.
For inference, we use Gibbs sampling [26] that is a Monte Carlo Markov Chain (MCMC) method to approximate true data distribution. We prefer a MCMC method over variational inference since our dataset is relatively small (total 3485 samples) and input vectors are too sparse (i.e. slight number of relation nodes are active). Therefore, we need precise inferences that MCMC methods can guarantee but variational inference method can not.[27].
IiiB Dataset Collection
There are two datasets with labeled spatial relations, [28] and CLEVR [22]. However, both datasets are simulated, and therefore, we collected a real dataset with relations.
We use a 3,485 (the ones acquired with newest depth sensors) of the SUNRGBD dataset [29]. Misspelled and redundant object labels were merged. In this way, total numbers of objects are reduced to 417. We extended the original dataset by adding eight spatial (left, right, front, behind, ontop, under, above, below) relations among annotated objects manually. All object pairs are considered as a relation. Therefore, a total number of relations that can be estimated is .
Let us use , where denotes sample, to denote the dataset. has a vector form that represents the presence of objects and relations among them in the scene. Active objects and relations have value 1, otherwise 0. Opposite relations (ex: left and right) can be represented in one relation in BMs since if object is left of object then object is right of object . As a result, each sample is represented by a binary vector that has length 695,973 .
There are 33 indoor scene types(kitchen, dining room etc.) that robots can be used for variety of tasks instead of humans in dataset. We split dataset into three: for training, for testing and for validation during training. All sets include samples from each scene category.
Iv Experiments and Results
In this section, we evaluate and compare the methods on several tasks.
Iva Network Training Performance
We calculated an error on difference between original data that are clamped to visible units and reconstructed visible states that are sampled in negative phase and observed how it changes during training:
(7) 
where is probability of activation of original data that is clamped to visible units at the beginning of positive phase. is probability of activation of visible nodes at the end of the negative phase. Cumulative sum over all samples are normalized with total number of samples ().
We look at the error separately for the objects and the relations. Figure 3 displays the error on the training and validation datasets. We see that the network consistently decreases the error, and learns to represent objects and relations between them. However, it somehow learns relations much faster. This difference is because the space of all possible relations is much larger than the objects set, and very sparse. The network quickly learns to estimate 0 (zero) for relations, which leads to a sudden decrease in the loss.
IvB Task 1: Relation Estimation
Our model can estimate possible relations among active concepts in the scene. For testing, active objects on the scene are clamped to visible object nodes and the model is let loose. Initially, model sees the environment in terms of “bag of objects” and samples hidden units (i.e., context). The context is determined by using objects on the scene without relations among them. Then, the relation nodes are sampled from contexts and objects. For this task, we define accuracy as the percentage of relations correctly estimated with respect to the labeled relations in the test dataset.
RBM  GBM  Triway BM  Chance level 

12.06%  14.18%  23.35%  % 
We evaluate this task with RBM, General BM and our Hybrid model, as shown in Table I. We see that our model provides highest accuracy. We do not consider inactive relation nodes in original data since the network has already learn which relation nodes should be inactive.
We provide some visual examples in Figure 4, where we see that our model nicely finds out how to place a set of objects together. Chance level of activation of one relation node is .
IvC Task 2: What is missing in the scene?
In this task, we randomly deactivate an object from the scene and expect the network to find the missing object. For this task, we define accuracy as the percentage of the missing objects found correctly in the test dataset.
As shown in Table II, our hybrid model performs better than RBM and GBM. See also Figure 5, which shows some visual examples for mostly likely objects found for a target position in the scene.
RBM  GBM  Triway BM  Chance level 

35.12%  40.94%  43.28%  % 
IvD Task 3: What is out of context in the scene?
Next, we evaluate how good the methods find an object that is out of context in the scene. For this end, we select scenes, remove randomly an object and add randomly another object not in the scene. Of course, during this process, the network might disrupt other objects in as well. To take this into account, for this task, we define error measure for a sample as the number of objects that are incorrectly sampled or changed. Let be the current scene representation, where is the randomly selected active object, and be the scene representation, where is removed (set to zero). After is clamped, the network settles to , where object is hopefully recovered, but there might be other unwanted changes on nodes other than . We can define our measure formally as follows:
(8) 
where is the absolute value function. Table III compares the methods, and shows that our hybrid model produces lowest error. See also Figure 6 for some visual examples.
In this task, models may tend to give higher contextual importance to particular objects for different scenes (i.e. “dishwasher” is a dominant object for the “kitchen” context and provides higher contextual information than a “chair” object for the “kitchen” context). Therefore, they can remove objects that have lower contextual information and corrupt original input data.
RBM  GBM  Triway BM  Chance level 

0.6446  0.1404  0.0789  0.5 
IvE Task 4: Generate a scene
In this task, we demonstrate how we can use another generative ability of our Triway BM: we can select a hidden node (or more of them, leaving the other hidden neurons randomly initialized or set to zero), and sample visible nodes (including relations) that describes a scene. Figure 7 shows some visual examples.
V Conclusion
We have proposed a novel method based on Boltzmann Machines for contextualized scene modeling. For this end, we extended BM by adding spatial relations between objects that are shared across different objects in the scene. Compared to RBM and DBM, we show that our hybrid model performs better in several scene analysis and reasoning tasks, such as finding relations, missing objects and outofcontext objects. Moreover, being generative, our model allows generating new scenes given a context or a part of the scene (as a set of objects).
Acknowledgment
This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) through project called “Context in Robots” (project no 215E133). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.
References
 [1] H. Çelikkanat, G. Orhan, and S. Kalkan, “A probabilistic concept web on a humanoid robot,” IEEE Transactions on Autonomous Mental Development, vol. 7, no. 2, pp. 92–106, 2015.
 [2] H. Celikkanat, G. Orhan, N. Pugeault, F. Guerin, E. Şahin, and S. Kalkan, “Learning context on a humanoid robot using incremental latent dirichlet allocation,” IEEE Transactions on Cognitive and Developmental Systems, vol. 8, no. 1, pp. 42–59, 2016.
 [3] X. Li, J.F. Martínez, G. Rubio, and D. Gómez, “Context reasoning in underwater robots using mebn,” arXiv preprint arXiv:1706.07204, 2017.
 [4] A. Anand, H. Koppula, T. Joachims, and A. Saxena, “Contextually guided semantic labeling and search for 3d point clouds,” 2012.
 [5] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm for boltzmann machines,” Cognitive science, vol. 9, no. 1, pp. 147–169, 1985.
 [6] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena, “Contextually guided semantic labeling and search for threedimensional point clouds,” The International Journal of Robotics Research, vol. 32, no. 1, pp. 19–34, 2013.
 [7] D. Lin, S. Fidler, and R. Urtasun, “Holistic scene understanding for 3d object detection with rgbd cameras,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1417–1424.
 [8] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic scenes for object detection,” IEEE PAMI, vol. 27, no. 11, pp. 1778–1792, 2005.
 [9] X. Wang and E. Grimson, “Spatial latent dirichlet allocation,” in NIPS, 2008, pp. 1577–1584.
 [10] J. Philbin, J. Sivic, and A. Zisserman, “Geometric LDA: A generative model for particular object discovery,” in BMVC, 2008.
 [11] D. Joho, G. D. Tipaldi, N. Engelhard, C. Stachniss, and W. Burgard, “Nonparametric bayesian models for unsupervised scene analysis and reconstruction,” Robotics, p. 161, 2013.
 [12] A. Pronobis and P. Jensfelt, “Largescale semantic mapping and reasoning with heterogeneous modalities,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2012, pp. 3515–3522.
 [13] F. Mastrogiovanni, A. Scalmato, A. Sgorbissa, and R. Zaccaria, “Robots and intelligent environments: Knowledge representation and distributed context assessment,” Automatika, vol. 52, no. 3, pp. 256–268, 2011.
 [14] W. Hwang, J. Park, H. Suh, H. Kim, and I. H. Suh, “Ontologybased framework of robot context modeling and reasoning for object recognition,” in Int. Conf. on Fuzzy Systems and Knowledge Discovery, 2006.
 [15] S. Blumenthal and H. Bruyninckx, “Towards a domain specific language for a scene graph based robotic world model,” arXiv preprint arXiv:1408.0200, 2014.
 [16] A. Saxena, A. Jain, O. Sener, A. Jami, D. K. Misra, and H. S. Koppula, “Robobrain: Largescale knowledge engine for robots,” arXiv preprint arXiv:1412.0691, 2014.
 [17] M. Tenorth and M. Beetz, “Knowrobâknowledge processing for autonomous personal robots,” in IEEE/RSJ IROS, 2009.
 [18] E. Stopp, K.P. Gapp, G. Herzog, T. Laengle, and T. C. Lueth, “Utilizing spatial relations for natural language access to an autonomous mobile robot,” vol. 861. Springer Science & Business Media, 1994, p. 39.
 [19] H. Celikkanat, E. Şahin, and S. Kalkan, “Integrating spatial concepts into a probabilistic concept web,” in International Conference on Advanced Robotics (ICAR). IEEE, 2015, pp. 259–264.
 [20] P. Meissner, R. Reckling, R. Jakel, S. R. SchmidtRohr, and R. Dillmann, “Recognizing scenes with hierarchical implicit shape models based on spatial object relations for programming by demonstration,” in International Conference on Advanced Robotics (ICAR). IEEE, 2013, pp. 1–6.
 [21] S. Guadarrama, L. Riano, D. Golland, D. Gouhring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell, “Grounding spatial relations for humanrobot interaction,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2013), 2013, pp. 1640–1647.
 [22] J. Johnson, B. Hariharan, L. van der Maaten, L. FeiFei, C. L. Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” arXiv preprint arXiv:1612.06890, 2016.
 [23] R. Salakhutdinov and G. Hinton, “Deep boltzmann machines,” in Artificial Intelligence and Statistics, 2009, pp. 448–455.
 [24] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann machines for collaborative filtering,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 791–798.
 [25] T. J. Sejnowski, “Higherorder boltzmann machines,” in AIP Conference Proceedings, vol. 151, no. 1. AIP, 1986, pp. 398–403.
 [26] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions, and the bayesian restoration of images,” IEEE Transactions on pattern analysis and machine intelligence, no. 6, pp. 721–741, 1984.
 [27] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, no. justaccepted, 2017.
 [28] D. Golland, P. Liang, and D. Klein, “A gametheoretic approach to generating spatial descriptions,” in Conference on empirical methods in natural language processing, 2010, pp. 410–419.
 [29] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgbd: A rgbd scene understanding benchmark suite,” in IEEE CVPR, 2015, pp. 567–576.