Not Only Look But Observe: Variational Observation Model of Scene-Level 3D Multi-Object Understanding for Probabilistic SLAM

Not Only Look But Observe: Variational Observation Model of Scene-Level 3D Multi-Object Understanding for Probabilistic SLAM


We present NOLBO, a variational observation model estimation for 3D multi-object from 2D single shot. Previous probabilistic instance-level understandings mainly consider the single-object image, not single shot with multi-object; relations between objects and the entire scene are out of their focus. The objectness of each observation also hardly join their model. Therefore, we propose a method to approximate the Bayesian observation model of scene-level 3D multi-object understanding. By exploiting variational auto-encoder (VAE), we estimate latent variables from the entire scene, which follow tractable distributions and concurrently imply 3D full shape and pose. To perform object-oriented data association and probabilistic simultaneous localization and mapping (SLAM), our observation models can easily be adopted to probabilistic inference by replacing object-oriented features with latent variables.


1 Introduction

Object-oriented features find various aspects such as semantic scene understanding, task planning and autonomous driving [39]. Real-time object detection and high-level feature estimation are inevitable in various applications such as object recognition, data association and object-oriented simultaneous localization and mapping (SLAM). Recently, a plethora of real-time multi-object detection methods have been developed beyond category classification for a single object image [38, 37, 24]. In some of these existing multi-object detection methods, however, the estimation results are still bound to the object categories and the location on the image (bounding box), and hardly concern other details.

For data association and object-oriented SLAM, it is better to exploit the various object representations such as complete 3D shape as well as categories. However, reconstruction methods such as [29, 50] and [49] are extremely challenging to perform in real-time on multi-object; these methods are conducted after scanning the various viewpoints of each object. Therefore, estimation methods which disentangle representations like 3D shape or viewpoint orientation from a 2D image have been carried out [54, 31]. With the emergence of deep learning, direct inference methods are also developed [41, 33, 52, 57]. Furthermore, by using the multi-object detectors or their network structures, estimation methods for instance-level understanding of multi-object are also studied [60, 21, 31].

Figure 1: Overview of the proposed method. We train VAE to estimate the joint distribution of multi-object in single scene. Since objects are jointly related to each other and the scene, variational likelihoods for each observations are estimated from the entire scene simultaneously. Our model involves the object uncertainty, thus objectness can be reflect to the probabilistic SLAM. 3D reconstruction also can be achieved by decoding the latent variables in parallel.

These approaches, however, mainly focus on directly obtaining the disentangled features via network modeling, not probabilistic model. Hence data association in Bayesian manner for probabilistic SLAM becomes challenging. Even when outputs of hidden layers of the network are used as object features, it is still challenging to address the Bayesian inference since features follow intractable distributions [53, 52]. As a result, in most cases there is little choice but to look at the outputs of well-designed networks; they just look at the objects so far, but not observe them in the true sense.

In order to approximate the intractable object observation model, [58] and [59] adopt the evidence lower bound (ELBO). However, these methods mainly focus on a single object and hardly consider multi-object observation. For single scene with multi-object, most of the instance-level understandings [11, 52, 54] and object-oriented SLAM [59, 1, 31, 16] should collect cropped region of interest (RoI) by using additional multi-object detector, and then input each image back into their models for single object. Therefore, relations between objects, and between objects and single scene are out of their concern. As object detector and observation model are separated from each other, objectness from multi-object detector is also hardly considered jointly with their work.

To this end, we propose a method for scene-level 3D multi-object understanding from single shot, and SLAM framework. We estimate the joint distribution of multi-object by using variational auto-encoder (VAE), considering the relations between objects and single scene. The complex joint probability of multi-object can be captured in factorized form by leveraging the latent-space. Latent variables are used instead of the object-oriented features for SLAM in Bayesian manner, thus frame-level understanding is easily reflected to SLAM formulations. Since our model possesses latent variables for objectness, object uncertainty for each observation is also considered in probabilistic data association. To achieve fast data association is crucial for real-time SLAM optimization, so we device a generative model that can reduce the dimensionality of latent variables. Overview of our method is shown in Fig. 1.

Our contributions are two-fold: First, we mathematically show that the multi-object observation model considering relations between objects and single scene can be captured, by exploiting the existing multi-object detector structure. Second, we introduce the probabilistic SLAM with our model, so that frame-level understanding and the object uncertainty are seamlessly reflected to the data association.

2 Related work

With the recent advent of neural networks, a number of single object classification and detection methods with high performance have been proposed [19, 14, 40]. Beyond obtaining one feature vector from one image for an object, several multi-object detection techniques from single shot have been developed by introducing new network structures [38, 36, 37, 24, 22]. In particular, some of these methods can be applied to various real-time tasks since the whole detection network is composed of single network pipeline.

Various studies have also been conducted to understand the instance-level representation from 2D images such as object shape, orientation or bounding box. [44, 27, 47] and [21] estimate the orientation of the object by viewpoint classification with discretized bins. In addition, 3D bounding box regression has been carried out to obtain the object location and orientation [42, 46, 32, 28]. In order to estimate the distinct 3D shape of objects, [54] aligns the prior shape to a single object image through key point matching and estimates its 3D shape and orientation together. [33] estimates the 3D mesh with linear combination of parameterized prior shapes. In [52, 11, 57, 51], they have actively utilized non-linear regression and latent variables of neural networks for 3D reconstruction from 2D.

Through multi-object detection and instance-level understanding altogether, learning the disentangled representation of multi-object becomes achievable. [46] exploits the yolov2 structure [37] to estimate the 3D bounding box and center of the multi-object and obtains the orientations. In [21], they estimate the 3D shape rendering and orientation under faster R-CNN structure [38]. They obtain the shape rendering via weighted sum of the parameterized prior shape with PCL. Orientations are estimated by classifying the bins which indicate the discretized object pose. Similarly, in [31], they design the object observation factor to perform data association for pose SLAM. RoIs for multi-object are obtained by [37].

These studies are efficient because they mainly concern direct and accurate estimation of the object characteristics through network modeling; on the other hand, the probabilistic observation models are relatively less considered. Although they exploit the neural network for nonlinear regression, approximating the intractable distribution is rarely concerned. Therefore, Bayesian inference with obtained features are challenging; for example, data association for SLAM is considered only in front-end and additional algorithms are necessary to perform loop closing and place recognition [39, 31].

To handle the intractable target distribution, latent variables can be adopted [8, 45, 17, 12]. In order to understand and utilize the latent space, [20, 34] have studied the relations between latent variables and object visualization by using VAE [17]. However, it is still challenging to apply the proposed method to probabilistic model approximation, as it mainly concentrates on the interpretable graphic codes. To approximate the observation probability, entropy and variational likelihood is exploited in the field of the active vision [6, 30, 3]. Using VAE, [58, 59] have proposed methods to approximate the observation model of 3D objects for Bayesian inference. Based on the ELBO which approximates the observation model, they have shown that how the probabilistic SLAM with data association can be performed with expectation-maximization (EM) algorithm.

However, the methods above only concern the observation model for single object, thus it is inevitable to use multi-object detector for single-object images in the scene. They hardly estimate the object observation model based on the entire scene. The relations between the objects and the scene are also merely considered. It is also challenging to include the object uncertainty in their model, since the objectness is determined by the multi-object detector. Therefore, we introduce the generative story of scene-level multi-object understanding for Bayesian inference, which is, to the best of our knowledge, the first of its kind.

Figure 2: Overview of the proposed Bayesian graphical model for the object generative model. (a) The object label and the orientation related to the observer throw a Bayesian dice to generate a 3D full shape of the object in the scene. For the objectness of an observed image, , which generates , can be involved. (b) We exploit the latent variables to approximate the target distribution. Here, , and are for bounding box, basic 3D shape and orientation of the object, respectively. For the prior distributions of , parameter is learned simultaneously with and , which are the parameters of the encoder and the decoder, respectively. (c) We assume that single scene is generated by the 3D objects and their bounding boxes. Therefore, the variational likelihoods of each observation are estimated from ; relations between objects, and objects and single shot can be considered.

3 Multi-object Observation Model

3.1 Evidence Lower Bound and Encoder

Suppose we select regions in a single scene and observe the full 3D shape of the arbitrary structure in each region. The ’th area of the scene can be defined with RoIs [38], grids of fixed size [36, 37], or grids of various sizes [24, 22]. The typical multi-object detection methods [38, 37, 24, 22] mainly focus on the real-time detection and category inference. For the generative story of object, however, any type of disentangled representations can be involved such as 3D shape or pose. Let be the th observed full 3D shape. Similar to [7, 59], we assume that the abel and the iewpoint orientation cast the Bayesian dice to generate the 3D shape as shown in Fig. 2(a): is the class or instance label of the 3D shape, and denotes the orientation of the shape related to the observer which can be represented in Euler angles. To address the object location in the scene, ounding box is also included in our story.

When observing single scene mage , objects are jointly related to each other since the objects and their locations determine . To catch the intractable observation model, the joint probability can be addressed. Since our main concern is the object-oriented features, we solely focus on the 3D shape of the object excluding background in the scene. For object and background discrimination, a latent variable for objectness can be added to the Bayesian graph model as shown in Fig. 2(a). The joint probability then can be fractionated as follows:


We let be a constant; that is, 3D shape can be observed in any arbitrary viewpoints.

To learn the complex probability distribution (1) using VAE, we first need to find the lower bound. The joint probability in (1) can be denoted as . Here, we assume that the prior is a uniform distribution, that is, any kind of structure can be detected in arbitrary location of a scene. The lower bound of then can be represented as:


In our work, we assume that the entire scene is generated by objects and their locations; therefore in (2) and the following, we let variational likelihood be estimated from in order to consider the correlations between objects and the entire scene.

Similarly, we can have ELBO of in (1) as the following:


The graph model including latent variable in (3) is shown in Fig. 2(b). With (2) and (3), the lower bound of (1) can be achieved; however, the formulation is composed of the joint probability of latent variables for entire objects, which is still intractable.

To relax the problem, we can have the lower bound in factorized form for each object by adopting the mean field inference [17]. We assume that all elements of latent variables and are independent to each other. That is, the lower bound can be factorized with which are for ’th region in . The lower bound of (1) then can be represented as:


where . and are the object and non-object (background) label set respectively. In (4), the first row can be viewed as the lower bound for objectness, second for 3D shape reconstruction and third for bounding box regression. The KL and expectation terms can be learned using encoding and decoding parts of VAE, respectively.

In this manner, joint observation model for multi-object can be captured as factorized form in latent space. Each is for each observation but estimated from the entire scene , and relations between objects are naturally taken into account. To implement the network for each observation, encoders are required. However, since the variational likelihoods of each observation share the parameter and are estimated from in common, encoders can be combined in single encoder which estimates all likelihoods simultaneously. In our work, we exploit YOLOv2-like structure as an encoder, which enables the real-time performance and end-to-end learning scheme. The graphical model with variational likelihood estimation is depicted in Fig. 2(c).

3.2 Low-Dimensional Latent Variables and Decoder

In order to learn the posterior for 3D shape reconstruction in (4), we can use 3D decoder which outputs the rotated 3D shape according to the observed viewpoint. When the algorithmic prior operation such as rotation transform exists, however, separating such arithmetically trivial operation from the non-linear regression technique can relieve the whole network, and enable the efficient learning [2, 18, 15, 35]. For the shape reconstruction term of (4), we can let , where is a function that rotates the basic orientation shape with rotation matrix arithmetically. Then we have:


where is the rotation matrix according to the shape . is the rotation matrix computed from , and thus we let be the trigonometric value of Euler angles as in [59]. Since we choose the binary voxelized grid to represent the 3D shape, in (5) is assumed to be binary distribution. For orientation, we let follow a Von Mises–Fisher distribution [26, 13].

With (5), the second row of (4) can be expressed as the following:


For the tractable prior distribution of , we assume . In other words, we let be a gaussian mixture model (GMM) as in [48, 59]. is obtained from prior network with parameter , which is trained with VAE simultaneously. The variational likelihoods except for are assumed to be isotropic Gaussians. For more details of each distribution, see Appendix I.

Our decoder estimates only the basic shape without considering the orientation relative to the observer. To complete observed shape inference, rotation transform on should be performed subsequently. In this way, we can relieve the burden of our network and reduce the parameters of it; we empirically found that the dimension of the latent variable can be decreased from to , which is crucial for SLAM performance. As described in the next section, latent variables can replace the object-oriented features, and the metric operation between each feature is inevitable for data association. Reduced dimensional latent variables thus make the SLAM optimization process up to 5 times faster. Additionally, with the low-dimensional Gaussian latent variables, the bubble effect is relaxed [4] and becomes distinct according to the label . Hence, robust data association is achieved.

4 Latent Variables and Probabilistic SLAM

Consider the localization and mapping problem with object-oriented features. Let be the pose of an observer. Assume we have a collection of landmarks.

Now suppose that the observer navigates around the area and obtain a set of observation for keyframes. Here, ’th observation can be about either an object or background. In the previous works for object-oriented SLAM [1, 31, 59, 16], they lean to the existing multi-object detection method to get rid of the non-object detection; frame-level joint probability and objectness are hardly considered when formulating SLAM optimization. In our method, since objectness joins our single scene understanding and thus naturally affects the data associations, all observations obtained from the regions of can be seamlessly used for SLAM.

Adopting the lower bound (4) as approximated observation model, the optimal and for the probabilistic SLAM is obtained from the maximization step of EM formulation:


where and .

Before calculating (7), in expectation step, the similarity weight is calculated in consideration of the objectness of observations. Note that the object-oriented feature is replaced with encoded variables and for each ’th grid, even though we start with joint probability of ’th keyframe. Also, for the likelihood of object-oriented feature, tractable latent priors and are used which are isotropic Gaussians. Therefore, with simple derivations, optimal solutions can be achieved even if the inaccurate observations are made, as the objectness of the multiple observation is concurrently considered. Details of the EM formulation can be found in Appendix II.

Figure 3: Proposed network architecture for variational 3D object observation. We use darknet-19 in YOLOv2 as a core network of the encoder. The generator in 3D-GAN is adopted for the decoder. We set the dimension of the latent variable to , since the rotation task is separated to the decoder. For the prior network consists of fully connected (FC) layers, and is trained with the encoder and the decoder simultaneously. When the encoder network has single scene and infers latent variables for each RoI, we only select the latent variables of objects; therefore, the decoder gets as many latent variables as objects in a scene. In other words, the batch size varies between the encoder and the decoder when training as well as test. Our network can be trained in an end-to-end manner.

5 Implementation

To implement the proposed observation network, in this paper we use darknet19 structure [37] for the encoder core (or backborn). We construct the encoder by adding 3 convolutional layers with 1024 filters followed by one convolutional layer on top of the core network. A predictor (encoder) predicts for each grid. In other words, predictor infers for objectness, for latent variables implying full shape, for viewpoint inference, and for bounding box. The decoder follows the generator structure of [52], except the input dimension; which in our case is set to 16. A prior network consists of dense layers to represent GMM for prior distribution . As in [59], prior network is trained with VAE simultaneously.

Similar to [37], when training we consider one predictor which predicts the highest IOU with the ground truth bounding box as responsible predictor for an object. After selecting the predictors observing the object in the grids, shape estimation is performed by inputting the latent variables obtained from that predictors to the decoder. Therefore, during both training and testing, the input batch size varies between the encoder and decoder; when single scene enters the encoder, the decoder gets as many latent variables as the number of objects in the scene. The proposed network structure is displayed in Fig. 3.

6 Training details

The proposed network estimates the various representations of multi-object in single scene, with probabilistic distributions. The negative lower bound from (4), which is composed of KL divergence and expectation terms of various distributions, is used as the training loss. The network thus easily diverged without sophisticated training strategy. For the stable optimization procedure, we replace the objectness term in (4) with in actual training. Then objectness loss becomes equal to the conventional binary cross-entropy loss.

We also found that the two-stage pretraining make the main training stabilized: pre-pretraining of 2D-3D understanding, and pretraining of NOLBO for single object. We first pre-pretrain the encoder core on the ImageNet dataset [5] for object classification and Render for CNN dataset [44] for viewpoint classification sequentially. In the case of the decoder, it can fall into the local minimum when learning a small number of object instances, since the decoder only infers the basic 3D shape without considering the rotation transform. To alleviate this limitation, the decoder is also pre-pretrained on ModelNet40 dataset [53] which contains 40 classes, and about 300 instances per each class. We construct 3D VAE with decoder, 3D encoder in [52] and prior network for this pre-pretraining.

The NOLBO approximates the observation model by learning 3D reconstruction and pose estimation for multi-objects in 2D scenes. Therefore, we use Pascal3D+ [56] and objectnet3D [55] training datasets comprising 2D-3D aligned annotations considering 3D shape pose. Object orientation is expressed as azimuth, elevation and in-plane rotation angles. Since these datasets contain 100 classes in total, we manually select 40 classes. Prior to the NOLBO multi-object training, we pretrain NOLBO for single-object [58, 59] on these datasets with pre-pretrained encoder core, decoder and a fresh prior network. As [37], networks are trained on multi-scale images. Since the networks slowly converge when trained on multi-resolution images from the beginning, we first fix the image resolution to , and change to when the accuracy of 3D reconstruction reaches about 70% mAP. Onece the accuracy reaches 70% again, multi-scale training is started.

The networks for NOLBO multi-object are constructed based on encoder core, decoder and prior network of the pretrained NOLBO single-object. Training starts with the multi-resolution images. We use adam optimizer with a starting learning rate of for the first epoch, and increase to . For the 20 epochs, we freeze decoder and prior network as they already learn the shape distribution. This allows the network to learn how to infer the 3D shape distribution from 2D scene without diverging. Similar to [37, 59], gaussian blur, HSV saturation, RGB invertion and random brightness are applied to 2D scene data augmentation. Random translation and scaling are also used.

All of our code and pre-trained models are available at

Figure 4: Comparison of the latent spaces of (a) TLNet, (b) 2D-3D auto-encoder, (c) vanilla VAE and (d) NOLBO. We also plot the latent space of the prior distribution in (e), which is trained simultaneously. To visualize the prior distribution, we sample the latent variables from and plot them. Latent variables are colorized according to their respective object categories. Some of the latent variables in (a-c) are separately distributed even in the same categories. In the case of NOLBO, latent variables tend to be grouping according to their categories, as the KL-divergence terms in (4) enforce the latent variables to follow the prior distribution displayed in (e).
aero bike boat bottle bus car chair table mbike sofa train tv mean
Acc ([47]) 0.81 0.77 0.59 0.93 0.98 0.89 0.80 0.62 0.88 0.82 0.80 0.80 0.81
Acc ([28]) 0.78 0.83 0.57 0.93 0.94 0.90 0.80 0.68 0.86 0.82 0.82 0.85 0.81
Acc (Ours) 0.83 0.86 0.70 0.90 0.95 0.96 0.95 0.83 0.83 0.98 0.94 0.91 0.88
MedErr ([47]) 13.8 17.7 21.3 12.9 5.8 9.1 14.8 15.2 14.7 13.7 8.7 15.4 13.6
MedErr ([28]) 13.6 12.5 22.8 8.3 3.1 5.8 11.9 12.5 12.3 12.8 6.3 11.9 11.1
MedErr (Ours) 14.5 16.5 17.8 10.5 10.1 8.6 11.4 13.7 16.9 10.7 9.2 14.1 11.7
Table 1: Comparison of the Viewpoint Estimations with Ground Truth Bounding Box on Pascal3D+ test dataset
Figure 5: Precision-recall curve of 3D reconstructions on Pascal3D+ and Objectnet3D test dataset.
Figure 6: Examples of the 3D shape estimations of multi-object and classifications using MLE. We display several reconstruction results of objects in each 2D scene. As shown in the first row, the instance-level 3D shape estimation is achievable.
Figure 7: Several trajectory estimation results. we mark the important loop closing regions with red boxes. (a) Ground truth. (b) Visual odometry with known scale. (c) Observed objects. SLAM results using (d) NOLBOMulti, (e) NOLBOSingle and (f) vanilla VAE.

7 Experiments

We evaluate the proposed method in various aspects; disentangled representations and SLAM application. The main purpose of NOLBO is to approximate the object observation model. When the model is applied to the Bayesian inference, latent variables and their prior distributions are used as object-oriented features and their observation models. Therefore, it is important how objects are located or projected on the latent space as latent variables. For comparison of latent spaces, we construct a 2D-3D auto-encoder (AE), vanilla VAE (vVAE) [17] and TL-Network (TLNet) [11] by using our pretrained encoder and decoder structure. We also compare the SLAM results using object-oriented features from above networks.

7.1 Object Pose Estimation

Since NOLBO approximates the observation model, it is possible to perform category classification and viewpoint estimation using MLE (see APPENDIX II). The latent variables for orientation becomes the orientation itself, so viewpoint orientation can directly be inferred from the encoder. To report the qualitative results of our pose estimation method, we train NOLBO with 2D images of fixed size. The comparison of the viewpoint estimation is shown in Table 1. Our method shows competitive results relative to previous works.

7.2 Object Observation and Latent Space

For the data association or MLE, NOLBO have advantages that latent variables follow distinct prior distributions for each class or instance. In order to display how objects are projected into the latent space according to their respective class, we estimate the latent variables from NOLBO and other methods. We use the outputs from 2D encoders of other non-probabilistic methods as latent variables. The dimensions of the latent variables are set to 16, same as NOLBO. We display the latent variables obtained from each network in Fig. 4. We use t-sne method [25] for the latent space dimension reduction in order to plot in 2D. For clarity, we randomly choose 10 classes and show the results. Since NOLBO enforce the latent variables to follow the prior distributions which are learned together, the latent variables are projected to particular distributions according to their categories unlike other methods.

NOLBOMulti16 NOLBOMulti128 NOLBOSingle16 AE16 TLNet16 vVAE16 VO
ATE 25.93 28.66 52.79 37.88 62.56 64.78 30.75
RPE 29.88 31.71 65.03 40.83 73.02 72.07 35.40
Table 2: Average precisions of the trajectory estimation results evaluated on KITTI dataset

7.3 Object 3D Shape Reconstruction

For comparison of the 3D reconstruction results, we additionally train 3D VAE-GAN. Since this method trains one 3D-GAN for each object category, it is hard to apply the algorithm to 40 categories. Therefore, we manually choose the classes for both training and testing, which have similar shapes to others: bench, chair and sofa. Rest of the methods are evaluated on 40 classes. The precision-recall curve of 3D shape reconstruction is depicted in Fig. 5.

To verify the reconstruction results in various environments, we evaluate NOLBO on MS-COCO [23], TUM [43] and KITTI dataset [10]. Some of the 3D reconstructions of multi-object are displayed in Fig. 6. To clearly show the reconstructed 3D shapes, objects are aligned in arbitrary viewpoint. For more quantitative results of multi-object 3D reconstruction, see our supplementary.

7.4 Multi-object Observation and SLAM

To verify the probabilistic SLAM using our observation model, we choose KITTI dataset for autonomous driving. Among 10 sequences of KITTI, 00, 05, 06 and 07 are used which have various loop closing spots. To evaluate the aspect of data association, we compare our SLAM results with that of using latent variables from other object understanding networks. Results using 128dim-version NOLBO is also compared. Average precisions of the trajectory estimation are represented in Table.2. We also display some of the results in Fig. 7. For SLAM, any visual odometry can be used for initial trajectory estimation. However, to compare data association results more clearly, we use low-performance visual odometry based on simple visual feature matching procedure with known scale as shown in Fig. 7(b). Results using our model with objectness estimated from the entire single scene show better loop closing performance than that of using the single-object model. Even if our method is trained on indoor datasets, it can achieve the robust object observation in the wild and can consider the various loop closing points. As shown in Table.2, with high dimensional (128) latent variables, data association occasionally fails due to the bubble effect. For more results, refer to see our supplementary.

8 Conclusion

We have proposed an observation model approximation for 3D multi-object in single 2D scene. Since 3D objects in a shot are related to each other and follow a complex joint distribution, object-oriented probabilistic SLAM poses a challenge. Therefore, we approximate the joint distribution of the multi-object present in a single 2D scene using VAE. The jointly correlated multi-object observation is represented with factorized tractable distributions in the latent space. Each observation can be estimated in consideration of the entire scene, by exploiting the existing multi-object detector structure. Since our generative story involves the objectness, observation uncertainty naturally affects the probabilistic data association for SLAM. The network is essentially an auto-encoder so that it can be used to estimate the 3D shape and pose of multi-object in a single shot.


This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP) (No. 2017R1A2B2002608), in part by Automation and Systems Research Institute (ASRI), and in part by the Brain Korea 21 Plus Project.

9 Appendix

In this section, we present detailed formulations for probabilistic simultaneous localization and mapping (SLAM). From Appendix I to III, we specify the probability distributions for our observation model, and introduce formulations for several Bayesian inference methods: in Appendix I, we introduce the distributions for latent variables in detail. In Appendix II, we show how our approximated observation model can be adopted for probabilistic semantic SLAM, which can be performed using expectation-maximization (EM) method. Examples of the maximum likelihood estimation (MLE) with our model is presented in Appendix III.

Appendix I : Lower Bound and Probability Distributions for Observation Model

Combining Eqn. (4) and (6) of the paper, the lower bound of can be represented as the following:


The probability representing the objectness according to the label is defined as:


where . For convenience, we let . Here, is assumed to be the Bernoulli distribution of objectness.

For the tractable prior distribution of , we assume . In other words, we let be a gaussian mixture model (GMM). is obtained from prior network with parameter , which is trained with VAE simultaneously. To simplify the network structure, the variance is assumed to be . Similarly, the variational likelihood is assumed to be a multivariate Gaussian; .

To estimate the orientation of objects, classifying the discretized angles can be trained easily as compared to the continuous regression [27, 21]. For the probabilistic modeling and observation uncertainty, however, it is much useful to assume a noise model rather than a multinoulli distribution. Since represented as radian is natural to follow a Gaussian distribution, we can let the latent variable be the angle directly by assuming . However, as the direct estimation of the angle is challenging to the network [7], we let the trigonometric function values of be the latent variables rather than itself. In other words, we let be the prior of the latent variable related to the orientation [26, 13]. Assuming and are i.i.d., , , and can be represented as follows [26]:


where . We set for to [59], because when the lower precision is used, learning easily diverges. Similar to the prior, a variational likelihood for the orientation is assumed to be a wrapped normal distribution with trigonometric functions: . Therefore, for we first infer and from the network, then calculate mean and variance similar to (10). In practice, we give an additional constraint in order to consider that trigonometric functions are not independent. Since the inferred latent variables are the trigonometric values of the pose, rotation matrix related to can directly be computed.

Appendix II : Object-oriented Probabilistic Semantic SLAM with Approximated Observation Model

A. EM Formulation of Probabilistic Semantic SLAM

Classical SLAM methods usually divide the problem into two parts: data associations in front-end, and pose optimization in back-end [?, 39, ?, 31]. With these approaches, the huge error is inevitable when the false data association occurs in the front-end, since incorrectly determined data associations are hardly modified and thus bring a highly detrimental effect on pose estimation in the back-end.

To avoid this limitation, the complete SLAM formulation with probabilistic data association can be achieved using Expectation-Maximization (EM) method [1, 58, 59]; both pose and data association can be optimized simultaneously. Consider the localization and mapping problem with object-oriented features. Our observer (camera or drone) collects a set of single shots as keyframes. Let be the pose of an observer that navigates the unknown area; and represent 3D position and orientation of the observer, respectively. Assume we have a collection of landmarks; stands for the category or instance-level label, and and denote 3D position and orientation of the landmark in the global coordinate, respectively. Now suppose that the observer navigates around the area and obtain a set of 3D object observation . Here, ’th observation consists of the 3D position and the full shape considering the object’s orientation related to the observer. The EM formulation for the probabilistic SLAM is expressed as follows [1, 58]:


is the set of all possible data associations representing that the object detection of landmark was obtained from the observer state . Also, is the set of all possible data association such that th detection is assigned to th landmark.

Now suppose the data associations for ’th keyframe are fixed, then we have:


where and . Since we assume that is generated by and , observation probability (13) can be split as follows:


where , and . For the first factorized term in (14), we can say that the set of feature’s 3D position is determined by the set of observer’s position and the observed object’s position . The term in (14) denotes that the observer with states observes 3D shapes of objects with label and pose . It is equivalent that the observer observes the object placed according to the local orientation , which is related to the observer; relation between and can be represented as:


where denotes the rotation transform matrix according to the pose . Therefore, without loss of generality, we can rewrite (14) as the following:


Substituting (16) to (11) and (12) finally yields:


B. Variational Observation Model Approximation and EM for Probabilistic SLAM

Expectation Step

The evidence lower bound (ELBO) nearly reaches the target distribution when variational auto-encoder (VAE) is correctly converged. Therefore, we can let the lower bound in (8) approximately represent the joint distribution for the 3D shape observation probability in (16) as follows:


Take exponential to both sides, we have:


In (20), is a constant, and

Note that although the original joint probability for single scene is intractable and extremely challenging to express in factorized form, we can have such factorized form as (20) that follow tractable Gaussians by leveraging the latent space.

For , we have:


Since our main concern is object-oriented SLAM, it is necessary to focus on the object observations. To achieve this, we only calculate (21) for the prior where . Then the cases (a) and (b) of (21) remain, which indicate that if the objectness of an observation is close to true, (21) goes to 1; else 0. In other words, (21) can be seen as the probability of objectness when calculated only on for . With this manner, all observations for regions can be used naturally by letting the probability of objectness affect to the weight calculation; we can reflect both ‘more object-like’ observations and ‘less-like’ one into the weight. There is no need to just abandon several observations which have the objectness under certain threshold, and give the equal weights to others.

Substituting (20) into (17), we have:


We assume the 3D positions of observations are i.i.d. Since and are independent to , we can reduce the fraction. Then (22) can be expressed as:


Meanwhile, the KL-divergence term of in (23) is expanded as:


The prior of is assumed to be a multivariate Gaussian: . The variational likelihood is also represented as . Note that and are the variables estimated from , using the encoder. In other words, these variables are encoded features from the observed . With the prior and the variational likelihood, we have:


where is the normalization constant. With (25), we can express as:


Meanwhile, we have assumed that for angles follows Gaussian and the network is trained with trigonometric values of . The encoder thus infers the trigonometric values of and precision: , and . Using these values, we can calculate . Therefore, similar to (26), in (23) can be represented as follows: