Neural Multisensory Scene Inference
Abstract
For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. We also introduce a novel method, called the Amortized ProductofExperts, to improve the computational efficiency and the robustness to unseen combinations of modalities at test time. Experimental results demonstrate that the proposed model can efficiently infer robust modalityinvariant 3Dscene representations from arbitrary combinations of modalities and perform accurate crossmodal generation. To perform this exploration, we also develop the Multisensory Embodied 3DScene Environment (MESE).
1 Introduction
Learning a world model and its representation is an effective way of solving many challenging problems in machine learning and robotics, e.g., via modelbased reinforcement learning (Silver et al., 2016). One characteristic aspect in learning the physical world is that it is inherently multifaceted and that we can perceive its complete characteristics only through our multisensory modalities. Thus, incorporating different physical aspects of the world via different modalities should help build a richer model and representation. One approach to learn such multisensory representations is to learn a modalityinvariant representation as an abstract concept representation of the world. This is an idea well supported in both psychology and neuroscience. According to the grounded cognition perspective (Barsalou, 2008), such abstract concepts like objects and events can only be obtained through perceptual signals. For example, what represents a cup in our brain is its visual appearance, the sound it could make, the tactile sensation, etc. In neurosciences, the existence of concept cells (Quiroga, 2012) that responds only to a specific concept regardless of the modality sourcing the concept (e.g., by showing a picture of Jennifer Aniston or listening her name) can be considered as a biological evidence of the metamodal brain perspective (PascualLeone & Hamilton, 2001; Yildirim, 2014) and the modalityinvariant representation.
An unanswered question from the computational perspective (our particular interest in this paper) is how to learn such modalityinvariant representation of the complex physical world (e.g., 3D scenes placed with objects). We argue that it is a particularly challenging problem because the following requirements need to be satisfied for the learned world model. First, the learned representation should reflect the 3D nature of the world. Although there have been some efforts in learning multimodal representations (see Section 3), those works do not consider this fundamental 3D aspect of the physical world. Second, the learned representation should also be able to model the intrinsic stochasticity of the world. Third, for the learned representation to generalize, be robust, and to be practical in many applications, the representation should be able to be inferred from experiences of any partial combinations of modalities. It should also facilitate the generative modelling of other arbitrary combinations of modalities (Yildirim, 2014), supporting the metamodal brain hypothesis – for which human evidence can be found from the phantom limb phenomenon (Ramachandran & Hirstein, 1998). Fourth, even if it is evidenced that there exists metamodal representation, there still exist modalitydependent brain regions, revealing the modaltometamodal hierarchical structure (Rohe & Noppeney, 2016). A learning model can also benefit from such hierarchical representation as shown by Hsu & Glass (2018). Lastly, the learning should be computationally efficient and scalable, e.g., with respect to the number of possible modalities.
Motivated by the above desiderata, we propose the Generative Multisensory Network (GMN) for neural multisensory scene inference and rendering. In GMN, from an arbitrary set of source modalities we infer a 3D representation of a scene that can be queried for generation via an arbitrary target modality set, a property we call generalized crossmodal generation. To this end, we formalize the problem as a probabilistic latent variable model based on the Generative Query Network (Eslami et al., 2018) framework and introduce the Amortized ProductofExperts (APoE). The prior and the posterior approximation using APoE makes the model trainable only with a small combinations of modalities, instead of the entire combination set. The APoE also resolves the inherent space complexity problem of the traditional ProductofExperts model and also improves computation efficiency. As a result, the APoE allows the model to learn from a large number of modalities without tight coupling among the modalities, a desired property in many applications such as Cloud Robotics (Saha & Dasgupta, 2018) and Federated Learning (Konečný et al., 2016). In addition, with the APoE the modaltometamodal hierarchical structure is easily obtained. In experiments, we show the above properties of the proposed model on 3D scenes with blocks of various shapes and colors along with a humanlike hand.
The contributions of the paper are as follows: (i) We introduce a formalization of modalityinvariant multisensory 3D representation learning using a generative query network model and propose the Generative Multisensory Network (GMN). (ii) We introduce the Amortized ProductofExperts network that allows for generalized crossmodal generation while resolving the problems in the GQN and traditional ProductofExperts. (iii) Our model is the first to extend multisensory representation learning to 3D scene understanding with humanlike sensory modalities (such as haptic information) and crossmodal generation. (iv) We also develop the Multisensory Embodied 3DScene Environment (MESE) used to develop and test the model.
2 Neural Multisensory Scene Inference
2.1 Problem Description
Our goal is to understand 3D scenes by learning a metamodal representation of the scene through the interaction of multiple sensory modalities such as vision, haptics, and auditory inputs. In particular, motivated by human multisensory processing (Deneve & Pouget, 2004; Shams & Seitz, 2008; Murray & Wallace, 2011), we consider a setting where the model infers a scene from experiences of a set of modalities and then to generate another set of modalities given a query for the generation. For example, we can experience a 3D scene where a cup is on a table only by touching or grabbing it from some hand poses and ask if we can visually imagine the appearance of the cup from an arbitrary query viewpoint (see Fig. 1). We begin this section with a formal definition of this problem.
A multisensory scene, simply a scene, consists of context and observation . Given the set of all available modalities , the context and observation in a scene are obtained through the context modalities and the observation modalities , respectively. In the following, we omit the scene index when the meaning is clear without it. Note that and are arbitrary subsets of including the cases , , and . We also use to denote all modalities available in a scene, .
The context and observation consist of sets of experience trials represented as query()sense() pairs, i.e., and . For convenience, we denote the set of queries and senses in observation by and , respectively, i.e., . Each query and sense in a context consists of modalitywise queries and senses corresponding to each modality in the context modalities, i.e., (See Fig. S1). Similarly, the query and the sense in observation is constrained to have only the observation modalities . For example, for modality , an unimodal query can be the viewpoint and the sense is the observation image obtained from the query viewpoint. Similarly, for , an unimodal query can be the hand position, and the sense is the tactile and pressure senses obtained by a grab from the query hand pose. For a scene, we may have and . For convenience, we also introduce the following notations. We denote the context corresponding only to a particular modality by such that and . Similarly, , and are used to denote modality part of , , and , respectively.
Given the above definitions, we formalize the problem as learning a generative model of a scene that can generate senses corresponding to queries of a set of modalities, provided a context from other arbitrary modalities. Given scenes from the scene distribution , our training objective is to maximize , where is the model parameters to be learned.
2.2 Generative Process
We formulate this problem as a probabilistic latent variable model where we introduce the latent metamodal scene representation from a conditional prior . The joint distribution of the generative process becomes:
\eq
P_\ta(X,\bzV,C)
&= P_\ta(XV,\bz)P_\ta(\bzC)\nn
&= ∏_n=1^N_oP_\ta(\bx_n\bv_n,\bz)P_\ta(\bzC) = ∏_n=1^N_o∏_m∈\cM_oP_\ta_m(\bx_n^m\bv_n^m,\bz)P_\ta(\bzC).
2.3 Prior for Multisensory Context
As the prior is conditioned on the context, we need an encoding mechanism of the context to obtain . A simple way to do this is to follow the Generative Query Network (GQN) (Eslami et al., 2018) approach: each context querysense pair is encoded to and summed (or averaged) to obtain permutationinvariant context representation . A ConvDRAW module (Gregor et al., 2016) is then used to sample from .
In the multisensory setting, however, this approach cannot be directly adopted due to a few challenges. First, unlike GQN the sense and query of each sensor modality has different structure, and thus we cannot have a single and shared context encoder that deals with all the modalities. In our model, we therefore introduce a modality encoder for each .
The second challenge stems from the fact that we want our model capable of generating from any context modality set to any observation modality set – a property we call generalized crossmodal generation (GCG). However, at test time we do not know which sensory modal combinations will be given as a context and a target to generate. This hence requires collecting a training data that contains all possible combinations of contextobservation modalities . This equals the Cartesian product of ’s powersets, i.e., . This is a very expensive requirement as increases exponentially with respect to the number of modalities^{1}^{1}1The number of modalities or sensory input sources can be very large depending on the application. Even in the case of ‘humanlike’ embodied learning, it is not only, vision, haptics, auditory, etc. For example, given a robotic hand, the context input sources can be only a part of the hand, e.g., some parts of some fingers, from which we humans can imagine the senses of other parts. .
Although one might consider droppingout random modalities during training to achieve the generalized crossmodal generation, this still assumes the availability of the full modalities from which to drop off some modalities. Also, it is unrealistic to assume that we always have access to the full modalities; to learn, we humans do not need to touch everything we see. Therefore, it is important to make the model learnable only with a small subset of all possible modality combinations while still achieving the GCG property. We call this the missingmodality problem.
To this end, we can model the conditional prior as a ProductofExperts (PoE) network (Hinton, 2002) with one expert per sensory modality parameterized by . That is, While this could achieve our goal at the functional level, it comes at a computational cost of increased space and time complexity w.r.t. the number of modalities. This is particularly problematic when we want to employ diverse sensory modalities (as in, e.g., robotics) or if each expert has to be a powerful (hence expensive both in computation and storage) model like the 3D scene inference task (Eslami et al., 2018), where it is necessary to use the powerful ConvDraw network to represent the complex 3D scene.
2.4 Amortized ProductofExperts as Metamodal Representation
To deal with the limitations of PoE, we introduce the Amortized ProductofExperts (APoE). For each modality , we first obtain modalitylevel representation using the modalencoder. Note that this modalencoder is a much lighter module than the full ConvDraw network. Then, each modalencoding along with its modalityid is fed into the expertamortizer that is shared across all modal experts through shared parameter . In our case, this is implemented as a ConvDraw module (see Appendix B for the implementation details). We can write the APoE prior as follows: \eq P(\bzC) = ∏_m∈\cM_cP_ψ(\bz\br^m,m) . We can extend this further to obtain a hierarchical representation model by treating as a latent variable:
where is modalitylevel representation and is metamodal representation. Although we can train this hierarchical model with reparameterization trick and Monte Carlo sampling, for simplicity in our experiments we use deterministic function for where is a dirac delta function. In this hierarchical version, the generative process becomes: \eq &P(X,\bz, {\br^m}V,C) = P_\ta(XV,\bz)∏_m∈\cM_c P_ψ(\bz\br^m,m)P_\ta_m(\br^mC_m) . An illustration of the generative process is provided in Fig.S2 (b), on the Appendix. From the perspective of cognitive psychology, the APoE model can be considered as a computational model of the metamodal brain hypothesis (PascualLeone & Hamilton, 2001), which states the existence of metamodal brain area (the expertofexperts in our case) which perform a specific function not specific to input sensory modalities.
2.5 Inference
Since the optimization of the aforementioned objective is intractable, we perform variational inference by maximizing the following evidence lower bound (ELBO), , with the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014): \eq logP_\ta(XV,C) ≥\eE_Q_ϕ(\bz C,O)[logP_\ta(XV,\bz) ] \KL[Q_ϕ(\bzC,O) P_\ta(\bzC)] , where . This can be considered a cognitivelyplausible objective as, according to the “grounded cognition” perspective (Barsalou, 2008), the modalityinvariant representation of an abstract concept, , can never be fully modalityindependent.
APoE Approximate Posterior. The approximate posterior is implemented as follows. Following Wu & Goodman (2018), we first represent the true posterior as \eq &\fP(O,C\bz)P(\bz)P(O,C) = \fP(\bz)P(C,O)∏_m∈\cM_SP(C_m,O_m\bz) =\fP(\bz)P(C,O)∏_m∈\cM_S\fP(\bzC_m,O_m)P(C_m,O_m)P(\bz).\nn After ignoring the terms that are not a function of , we obtain Replacing the numerator terms with an approximation , we can remove the priors in the denominator and obtain the following APoE approximate posterior: \eqP(\bzC,O) ≈∏_m∈\cM_SQ_ϕ(\bzC_m,O_m) . Although the above product is intractable in general, a closed form solution exists if each expert is a Gaussian (Wu & Goodman, 2018). The mean and covariance of the APoE are, respectively, and , where and are the mean and the inverse of the covariance of each expert. The posterior APoE is implemented first by encoding and then putting and modalityid into the amortized expert , which is a ConvDraw module in our implementation. The amortized expert outputs and for while sharing the variational parameter across the modalityexperts. Fig. 2 compares the inference network architectures of CGQN, PoE, and APoE.
3 Related Works
Multimodal Generative Models. Multimodal data are associated with many interesting learning problems, e.g. crossmodal inference, zeroshot learning or weaklysupervised learning. Regarding these, latent variable models have provided effective solutions: from a model with global latent variable shared among all modalities (Suzuki et al., 2016) to hierarchical latent structures \nolink(Hsu & Glass, 2018) and scalable inference networks with ProductofExperts (PoE) (Hinton, 2002; Wu & Goodman, 2018; Kurle et al., 2018). In contrast to these works, the current study addresses two additional challenges. First, this work aims at achieving the anymodal to anymodal conditional inference regardless of modality configurations during training: it targets on generalization under distribution shifts at test time. On the other hand, the previous studies assumed to have full modality configurations in both training and test data. Second, the proposed model considers each source of information to be rather partially observable, while each modalityspecific data has been treated as fully observable. As a result, the modalityagnostic metamodal representation is inferred from modalityspecific representations, each of which is integrated from a set of partially observable inputs.
3D Representations and Rendering. Learning representations of 3D scenes or environments with partially observable inputs has been addressed by supervised learning (Choy et al., 2016; Wu et al., 2017; Shin et al., 2018; Mescheder et al., 2018), latent variable models (Eslami et al., 2018; Rosenbaum et al., 2018; Kumar et al., 2018), and generative adversarial networks (Wu et al., 2016; Rajeswar et al., 2019; NguyenPhuoc et al., 2019). The GANbased approaches exploited domainspecific functions, e.g. 3D representations, 3Dto2D projection, and 3D rotations. Thus, it is hard to apply to nonvisual modalities whose underlying transformations are unknown. On the other hand, neural latent variable models for random processes (Eslami et al., 2018; Rosenbaum et al., 2018; Kumar et al., 2018; Garnelo et al., 2018a, b; Le et al., 2018; Kim et al., 2019) has dealt with more generalized settings and studied on orderinvariant inference. However, these studies focus on single modality cases, so they are contrasted from our method, addressing a new problem setting where qualitatively different information sources are available for learning the scene representations.
4 Experiment
The proposed model is evaluated with respect to the following criteria: (i) crossmodal density estimation in terms of loglikelihood, (ii) ability to perform crossmodal sample generation, (iii) quality of learned representation by applying it to a downstream classification task, (iv) robustness to the missingmodality problem, and (v) space and computational cost.
To evaluate our model we have developed an environment, the Multisensory Embodied 3DScene Environment (MESE). MESE integrates MuJoCo (Todorov et al., 2012), MuJoCo HAPTIX (Kumar & Todorov, 2015), and the OpenAI gym (Brockman et al., 2016) for 3D scene understanding through multisensory interactions. In particular, from MuJoCo HAPTIX the Johns Hopkins Modular Prosthetic Limb (MPL) (Johannes et al., 2011) is used. The resulting MESE, equipped with vision and proprioceptive sensors, makes itself particularly suitable for tasks related to humanlike embodied multisensory learning. In our experiments, the visual input is RGB image and the haptic input is 132dimension consisting of the hand pose and touch senses. Our main task is similar to the ShepardMetzler object experiments used in Eslami et al. (2018) but extends it with the MPL hand.
As a baseline model, we use a GQN variant (Kumar et al., 2018) (discussed in Section 2.3). In this model, following GQN, the representations from different modalities are summed and then given to a ConvDraw network. We also provide a comparison to PoE version of the model in terms of computation speed and memory footprint. For more details on the experimental environments, implementations, and settings, refer to Appendix A.
CrossModal Density Estimation. Our first evaluation is the crossmodal conditional density estimation. For this, we estimate the conditional loglikelihood for , i.e. . During training, we use both modalities for each sampled scene and use 0 to 15 randomly sampled context querysense pairs for each modality. At test time, we provide unimodal context from one modality and generate the other.
Fig. 3 shows results on 3 different experiments: (a) HAPTICGRAY, (b) HAPTICRGB and (c) RGBHAPTIC. Note that we include HAPTICGRAY – although GRAY images are not used during training – to analyze the effect of color in haptictoimage generation. The APoE and the baseline are plotted in blue and orange, respectively. In all cases our model (blue) outperforms the baseline (orange). This gap is even larger when the model is provided limited amount of context information, suggesting that the baseline requires more context to improve the representation. Specifically, in the fully cross modal setting where the context does not include any target modality (the dotted lines), the gap is largest. We believe that our model can better leverage modalinvariant representations from one modality to another. Also, when we provide additional context from the target modality (dashed, dashdot, solid lines), we still see that our model outperforms the baseline. This implies that our models can successfully incorporate information from different modalities without interfering each other. Furthermore, from Fig. 3(a) and (b), we observe that haptic information captures only shapes: the prediction in RGB has lower likelihood without any image in the context. However, for the GRAY image in (a), the likelihood approaches near the upper bound.
CrossModal Generation. We now qualitatively evaluate the ability for crossgeneration. Fig. 1 shows samples of our crossmodal generation for various query viewpoints. Here, we condition the model on 15 haptic context signal but provide only a single image. We note that the single image provides limited color information about the object, namely, red and cyan are part of the object and almost no information about the shape. We can see that the model is able to almost perfectly infer the shape of the object. However, it fails to predict the correct colors (Fig. 1(c)) which is expected due to the limited visual information provided. Interestingly, the object part for which the context image provides color information has correct colors, while other parts have random colors for different samples, showing that the model captures the uncertainty in . Additional results provided in the Appendix D suggest further that: (i) our model gradually aggregates numerous evidences to improve predictions (Fig. S5) and (ii) our model successfully integrates distinctive multisensory information in their inference (Fig. S6).
Classification. To further evaluate the quality of the modalityinvariant scene representations, we test on a downstream classification task. We randomly sample 10 scenes and from each scene we prepare a heldout querysense pairs to use as the input to the classifier. The models are then asked to classify which scene (1 out of 10) a given querysense pair belongs to. We use Eq. (C) for this classification. To see how the provided multimodal context contributes to obtaining useful representation for this task, we test the following three context configurations: (i) imagequery pairs only (), (ii) hapticquery pairs only (), and (iii) all sensory contexts ().
In Fig. 4, both models use contexts to classify scenes and their performance improves as the number of contexts increases. APoE outperforms the baseline in the classification accuracy, while both methods have similar ELBO (see Fig. S4). This suggests that the representation of our model tends to be more discriminative than that of the baseline. In APoE, the results with individual modality ( or ) are close to the one with all modalities (). The drop in performance with only hapticquery pairs () is due to the fact that certain samples might have same shape, but different colors. On the other hand, the baseline shows worse performance when inferring modalityinvariant representation with single sensory modality, especially for images. This demonstrates that the APoE model helps learning better representations for both modalityspecific ( and ) and modalityinvariant tasks ().
Missingmodality Problem. In practical scenarios, since it is difficult to assume that we always have access to all modalities, it is important to make the model learn when some modalities are missing. Here, we evaluate this robustness by providing unseen combinations of modalities at test time. This is done by limiting the set of modality combinations observed during training. That is, we provide only a subset of modality combinations for each scene , i.e, . At test time, the model is evaluated on every combinations of all modalities thus including the settings not observed during training. As an example, for total 8 modalities ^{2}^{2}2left and right half of an image}, we use to indicate that each scene in training data contains only one or two modalities. Fig. 5(a) and (b) show results with while (c) and (d) with .
Fig. 5 (a) and (c) are results when a much more restricted number of modalities are available during training: 2 out of 8 and 4 out of 14, respectively. At test time, however, all combinations of modalities are used. We denote the performance on the full configurations by and on the limited modality configurations used during training by . Fig. 5 (b) and (d) show the opposite setting where, during training, a large number of modalities (e.g., 78 modalities) are always provided together for each scene. Thus, the scenes have not trained on small modalities such as only one or two modalities but we tested on this configurations at test time to see its ability to learn to perform the generalized crossmodal generation. For more results, see Appendix E.
Overall, for all cases our model shows good test time performance on the unseen context modality configurations whereas the baseline model mostly overfits (except (c)) severely or converges slowly. This is because, in the baseline model, the sum representation on the unseen context configuration is likely to be also unseen at test time and thus overfit. In contrast, our model as a PoE is robust to this problem as all experts agree to make a similar representation. The baseline results for case (c) seem less prone to this problem but converged much slowly. As it converges slowly, we believe that it might still overfit in the end with a longer run.
Space and Time Complexity. The expert amortizer of APoE significantly reduces the inherent space problem of PoE while it still requires separate modality encoders. Specifically, in our experiments, for the case, PoE requires 53M parameters while APoE uses 29M. For , PoE uses 131M parameters while APoE used only 51M. We also observed a reduction in computation time by using APoE. For model, one iteration of PoE takes, in average, 790 ms while APoE takes 679 ms. This gap becomes more significant for where PoE takes 2059 ms while APoE takes 1189 ms. This is partly due to the number of parameters. Moreover, unlike PoE, APoE can parallelize its encoder computation via convolution. For more results, see Table 1 in Appendix.
5 Conclusion
We propose the Generative Multisensory Network (GMN) for understanding 3D scenes via modalityinvariant representation learning. In GMN, we introduce the Amortized ProductofExperts (APoE) in order to deal with the problem of missingmodalities while resolving the space complexity problem of standard ProductofExperts. In experiments on 3D scenes with blocks of different shapes and a humanlike hand, we show that GMN can generate any modality from any context configurations. We also show that the model with APoE learns better modalityagnostic representations, as well as modalityspecific ones. To the best of our knowledge this is the first exploration of multisensory representation learning with vision and haptics for generating 3D objects. Furthermore, we have developed a novel multisensory simulation environment, called the Multisensory Embodied 3DScene Environment (MESE), that is critical to performing these experiments.
Acknowledgments
JL would like to thank ChinWei Huang, Shawn Tan, Tatjana Chavdarova, Arantxa Casanova, Ankesh Anand, and Evan Racah for helpful comments and advice. SA thanks Kakao Brain, the Center for Super Intelligence (CSI), and Element AI for their support. CP also thanks NSERC and PROMPT.
References
 Amos et al. (2018) Brandon Amos, Laurent Dinh, Serkan Cabi, Thomas Rothörl, Alistair Muldal, Tom Erez, Yuval Tassa, Nando de Freitas, and Misha Denil. Learning awareness models. In ICLR, 2018.
 Barsalou (2008) Lawrence W Barsalou. Grounded cognition. Annu. Rev. Psychol., 2008.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Burda et al. (2016) Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.
 Chetlur et al. (2014) Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
 Choy et al. (2016) Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In ECCV, 2016.
 Deneve & Pouget (2004) Sophie Deneve and Alexandre Pouget. Bayesian multisensory integration and crossmodal spatial links. Journal of PhysiologyParis, 98(13):249–258, 2004.
 Eslami et al. (2018) S. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu, and Demis Hassabis. Neural scene representation and rendering. Science, 2018.
 Garnelo et al. (2018a) Marta Garnelo, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J. Rezende, and S. M. Ali Eslami. Conditional neural processes. arXiv preprint arXiv:1807.01613, 2018a.
 Garnelo et al. (2018b) Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018b.
 Gregor et al. (2016) Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In NIPS, 2016.
 Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
 Hinton (2002) Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002.
 Hsu & Glass (2018) WeiNing Hsu and James R. Glass. Disentangling by partitioning: A representation learning framework for multimodal sensory data. arXiv preprint arXiv:1805.11264, 2018.
 Johannes et al. (2011) Matthew S Johannes, John D Bigelow, James M Burck, Stuart D Harshbarger, Matthew V Kozlowski, and Thomas Van Doren. An overview of the developmental process for the modular prosthetic limb. Johns Hopkins APL Technical Digest, 2011.
 Kim et al. (2019) Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. In ICLR, 2019.
 Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Konečný et al. (2016) Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private MultiParty Machine Learning, 2016. URL https://arxiv.org/abs/1610.05492.
 Kumar et al. (2018) Ananya Kumar, S. M. Ali Eslami, Danilo J. Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart, and Murray Shanahan. Consistent generative query networks. arXiv preprint arXiv:1807.02033, 2018.
 Kumar & Todorov (2015) Vikash Kumar and Emanuel Todorov. Mujoco HAPTIX: A virtual reality system for hand manipulation. In International Conference on Humanoid Robots, Humanoids, 2015.
 Kurle et al. (2018) Richard Kurle, Stephan Günnemann, and Patrick van der Smagt. Multisource neural variational inference. arXiv preprint arXiv:1811.04451, 2018.
 Lake et al. (2015) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 2015.
 Le et al. (2018) Tuan Anh Le, Hyunjik Kim, Marta Garnelo, Dan Rosenbaum, Jonathan Schwarz, and Yee Whye Teh. Empirical evaluation of neural process objectives. In NeurIPS Bayesian Workshop, 2018.
 Mescheder et al. (2018) Lars M. Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. arXiv preprint arXiv:1812.03828, 2018.
 Murray & Wallace (2011) Micah M Murray and Mark T Wallace. The neural bases of multisensory processes. CRC Press, 2011.
 NguyenPhuoc et al. (2019) Thu NguyenPhuoc, Chuan Li, Lucas Theis, Christian Richardt, and YongLiang Yang. Hologan: Unsupervised learning of 3d representations from natural images. arXiv preprint arXiv:1904.01326, 2019.
 Nickolls et al. (2008) John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, March 2008. ISSN 15427730.
 PascualLeone & Hamilton (2001) Alvaro PascualLeone and Roy Hamilton. The metamodal organization of the brain. In Progress in brain research, volume 134, pp. 427–445. Elsevier, 2001.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 Quiroga (2012) Rodrigo Quian Quiroga. Concept cells: the building blocks of declarative memory functions. Nature Reviews Neuroscience, 2012.
 Rajeswar et al. (2019) Sai Rajeswar, Fahim Mannan, Florian Golemo, David Vazquez, Derek Nowrouzezahrai, and Aaron Courville. Pix2scene: Learning implicit 3d representations from images. preprint https://openreview.net/forum?id=BJeem3C9F7, 2019.
 Ramachandran & Hirstein (1998) Vilayanur S Ramachandran and William Hirstein. The perception of phantom limbs. the do hebb lecture. Brain: a journal of neurology, 121(9):1603–1630, 1998.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Rohe & Noppeney (2016) Tim Rohe and Uta Noppeney. Distinct computational principles govern multisensory integration in primary sensory and association cortices. Current Biology, 26(4):509–514, 2016.
 Rosenbaum et al. (2018) Dan Rosenbaum, Frederic Besse, Fabio Viola, Danilo J. Rezende, and S. M. Ali Eslami. Learning models for visual 3d localization with implicit mapping. arXiv preprint arXiv:1807.03149, 2018.
 Saha & Dasgupta (2018) Olimpiya Saha and Prithviraj Dasgupta. A comprehensive survey of recent trends in cloud robotics architectures and applications. Robotics, 7(3):47, 2018.
 Shams & Seitz (2008) Ladan Shams and Aaron R Seitz. Benefits of multisensory learning. Trends in cognitive sciences, 12(11):411–417, 2008.
 Shin et al. (2018) Daeyun Shin, Charless C. Fowlkes, and Derek Hoiem. Pixels, voxels, and views: A study of shape representations for single view 3d object shape prediction. In CVPR, 2018.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 2016.
 Suzuki et al. (2016) Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891, 2016.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In IROS, 2012.
 Wu et al. (2016) Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In NIPS, 2016.
 Wu et al. (2017) Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In NIPS, 2017.
 Wu & Goodman (2018) Mike Wu and Noah Goodman. Multimodal generative models for scalable weaklysupervised learning. In NeurIPS, 2018.
 Yildirim (2014) Ilker Yildirim. From perception to conception: learning multisensory representations. University of Rochester, 2014.
Appendix A Experiments
We start from describing the Multisensory Embodied 3DScene Environment (MESE) environment and simulated datasets used in our experiments. We continue by explaining training settings.
a.1 Multisensory Embodied 3DScene Environment (MESE)
Targeting on a development environment for 3D scene understanding through interaction, we build a multisensory 3D scene environment, equipped with visual and proprioceptive (haptic) sensors, called Multisensory Embodied 3DScene Environment (MESE). MESE is similar to ShepardMetzler object experiments used in Eslami et al. (2018), but extends it with a MPL hand model of MuJoCo HAPTIX (Kumar & Todorov, 2015). The environment uses MuJoCo (Todorov et al., 2012) and the OpenAI gym (Brockman et al., 2016).
Scene. Adopted from Eslami et al. (2018), MESE generates single ShepardMetzler object with an arbitrary number of blocks per episode. Each block of the object is randomly colored in HSV scheme. More precisely, hue and saturation are randomly selected within fixed ranges: hue is sampled from (0, 1) and saturation is sampled from (0, 0.75). Value (in HSV) is fixed to 1. The sampled HSVs are converted to RGBs.
Image. An RGB camera is defined in the environment for visual input. The position of the camera and its facing direction, i.e. are defined as actions for agents. We refer to a viewpoint as the position and facing direction combined. From a given viewpoint, a generated RGB image has dimension.
Haptic. For proprioceptive (haptic) sense, the Johns Hopkins Modular Prosthetic Limb (MPL) (Johannes et al., 2011) is used, which is a part of MuJoCo HAPTIX. The hand model generates 132dimensional observation, consisting of the its actuator positions, velocities, accelerations, and touch senses. For more details about the MPL hand, please finds Appendix C. in Amos et al. (2018). The MPL hand model has 13 degrees of freedom to control. MESE adds 5 degrees of freedom, i.e. , to control the position and facing direction of the hand’s wrist, similar to camera control.
a.2 Datasets
Given that each scene has a single object at the origin, images and haptics are randomly generated. For an image, a camera viewpoint is sampled on a spherical surface with a fixed radius while the camera faces to the object. We refer to image query as camera viewpoint.
For an haptic data in each scene, we first sample a wrist pose of the hand similar to generating camera viewpoints. Given the sampled wrist, a fixed deterministic policy is performed.The policy starts from a stretched hand pose to gradually go to grabbing posture without any stochasticity. Note that a haptic datapoint is a function of the wrist pose and the object, given the aforementioned fixed policy; thus, the wrist’s position and facing direction is set to the hand’s query. Each dimension of haptic data is rescaled to .
For the environment , also denoted as , 1M scenes are collected for training data. For each scene, a ShepardMetzler object with 5 parts is randomly sampled as described in Section A.1. The number of unique shapes is 728 for the 5parts object dataset. In each scene, 15 queries and corresponding sensory outputs are randomly sampled for each sensory modularity. For validation and test data, 20k and 100k scenes are sampled, respectively.
For the environment whose is larger than 2, we slice the dimensions of image and haptic data. For example, in order to build an environment , image is split to four quadrants of it so that each image is one of {, , , }. In addition to these four visual modalities, haptic input is provided. Note that while we split each image into four, the corresponding experiment defers from image inpaining or denoising tasks. In those image tasks, statistical regularities of image are heavily taken into account, i.e. statistics of local receptive fields are almost identical regardless of position. Many recent solutions on the problems resort to convolutional architectures, as a practical solution for sharing parameters of models across arbitrary locations. As long as the inductive bias hasn’t made use of in any model, it is valid that they are distinct random variables, each of which has different statistical characteristic; thus, they can be treated as multiple modalities.
For , image is cropped to and resized to due to the memory overhead. The image is split to leftright for each RGB channel; thus, we have {, , , , , } as different senses. Haptic dimension is also divided into to two, i.e. and . corresponds to thumb, index, and middle fingers. corresponds to ring and little fingers, as well as palm.
For , image is converted to as in , but is is sliced as to . With haptic data divded to and , we get an environment with .
a.3 Training
For training, Adam optimizer is used (Kingma & Ba, 2014). annealing^{3}^{3}3In here, annealing refers to anneal a weight at KL term of ELBO as done in Higgins et al. (2017) is employed; is set to 0.11 for the first epoch and maintained as 1 for the rest of training. Learning rate is set to 0.0001. In order for stable training, gradient is clipped to . Training is ran for 10 epochs. Minibatch size is set to 14 scenes for the environment and 24 scenes for the 5, 8, and 14.
Appendix B Network Architectures
Overall. We adopt CGQN network architecture from Kumar et al. (2018) for the proposed model, as well as the baseline. This architecture can be thought of as a modified version of ConvDraw encoderdecoder, in which the posteriors don’t have feedback routes of the predicted inputs and the residuals between the target and the predictions, unlike the original one (Gregor et al., 2016). As a result, for every step of ConvLSTM iterations the same input is repeatedly provided instead (see Fig. S3 (a)).
For a baseline, we use the CGQN network, and the baseline’s generation process is depicted in Fig. S2 (a). Each instance of th modality querysense pairs feeds to , i.e. th representation network. All instances of representation s will summed up to get representation . Metamodal scene representation is inferred using the CGQN decoder (or encoder in inference). Conditioning on the and a query , a sensory datapoint will be generated using , i.e. the renderer for the th modality.
For APoE, multiple experts are modeled as a single network, called expertamortizer, in which a binary mask to identify modularity is used while inferring , e.g. in Eq. (2.4) where a binary mask. The expertamortizer is build upon further modifications from the modified ConvDraw, as shown in Fig. S3 (b). Especially for efficient computation, the expertamortizers are implemented such that they perform convolution over .
See Fig. S2 (b) for APoE’s generation. Identical to the baseline, each instance of th modal querysense pairs feeds to , and they are summed up to get modalityspecific representation . However, metamodal scene representation is inferred via productofexperts using the expertamortizer network.
For PoE, each expert is modeled a single ConvDraw encoderdecoder with corresponding modularity encoder, and the rest of its implementations are identical to APoE.
Representation Network. To estimate modalityspecitic representation for each instance of a querysense pair, tower representation networks proposed in Eslami et al. (2018) is used. For camera positionimage pair, convolution layer is used in the tower representation network. Similar to image, MLP is applied for a haptic observation and its corresponding query. The same representation network architectures are used to baseline, PoE, and APoE.
Renderer. Renderer network is a part of a decoder for predicting each sensory modality. Those renderers get a query and modalityagnostic latent representation as its inputs, output the sensory data conditioning on them. For rendering image, ConvLSTM with convolutional layer is used as done in CGQN. Similar to image rendering, ConvLSTMs is used for proprioceptive, but MLP is employed instead of convolution layer.
Appendix C Classification
For classification, we adopt a method from Lake et al. (2015). Let we have number of context sets, for , each of which is a set that contains multisensory dataquery pairs. Given we have an observation set obtained from one of the scenes (more precisely objects), we can predict from which scene the new observation set comes. The predicted label is obtained by following; \eq ^k = \argmax_k logP_\ta(XV,C^(k)) , where each is estimated by using the loglikelihood estimators from Burda et al. (2016). This method doesn’t require additional training process.To approximate the loglikelihood each , 50 latent samples are used.
For heldout dataset, 1000 additional ShepardMetzler objects with 4 or 6 parts are generated: any of these objects hasn’t presented in training dataset. is set to 10. For all models, three different inference scenarios are considered; classification is performed by using (i) only imagequery pairs from each scene (), (ii) hapticonly contexts (), and (iii) use both sensory contexts ().
The results are shown in S4. In order to claim that both models are well trained and converged to training dataset, the learning curves for baseline and APoE models used in classification are also attached.
Appendix D Crossmodal Generation
d.1 Reducing Uncertainty with Aggregation of Evidences
In this task, we examine the uncertainty of modalityagnostic representations with respect to the number of contexts. Similar to Fig. 1, we provide a single image context but we condition a trained model on different numbers of haptic contexts. More precisely, the image context is given such that the model cannot recover the entire scene from the image.
The generated image samples are shown in Fig. S5 (a). As the number of the haptic contexts increases, more accurate visual observation is predicted. We can also observe that generated images at each column continue to develop in comparison to the previous column, corresponding to where additional haptic information is provided. Again, we observe that the part of the object for which the context image provides color information has similar colors while other part of the block has random colors.
Fig. S5 (b) describes the generated haptic samples from the same query. In this figure, 95%confidence interval from 20 samples is also illustrated. Similar to visual prediction, the haptic prediction improves according to the number of the haptic contexts. In addition, it is demonstrated that the uncertainty of the prediction reduces as the contexts aggregate.

d.2 Anytoany Crossmodal Generation
Additional crossmodal generation experiments are performed for in order to explore multisensory integration of arbitrary context conditions. Given any context condition, a trained model is asked to generate all modality outputs (for a given set of queries) and these are combined to be displayed. For instance, a model trained in generates outputs in all modalities, i.e. . The visual outputs are combined and displayed as shown in Fig S6 (d). The haptic outputs are omitted to conserve space.
Three different context conditions are applied for each environment.For , a model is conditioned on (i) only, (ii) , and (iii) contexts. For , a model is conditioned on (i) , (ii) , and (iii) . For , a model is conditioned on (i) , (ii) , and (iii) . Each context modality is provided with 5 querysense pairs.
The results are shown in Fig. S6 (b)(d), and the ground truth images are given in Fig. S6 (a). The provided context senses are illustrated in the first and second rows in each experiment. In general, hapticrelated contexts are sufficient for the learned models to infer the shapes. With additional visual cues, the models start to correctly predict colors. For example, in middle column of Fig. S6 (c) and (d), redmixed colors are successfully inferred with channel context; however, it still fails to predict all color patterns as other color information is deficient. As more color information is given, our models results in successful predictions of all color patterns as shown in the right column of Fig. S6 (c) and (d).
Appendix E Missingmodality Problem
In addition to the experiments described in Fig. 5, more results are added in S7. Note that loss is evaluated as moving average of minibatches, while is estimated on whole batch at the end of each epoch. This explains validation loss sometimes lower than training’s in the figures.
In general, all models tend to underfit when they have never seen entire modalities during training. On the other hand, the models exposed to many modalities are prone to give us tighter negative ELBO.
We can observe notable difference between the baseline and APoE on the settings secluded from individual modality during training. Combined with the classification in Fig. S4, we can interpret the results as that PoE helps training individual expert. The inference of PoE has been understood as an agreement of all experts (Hinton, 2002); therefore, this lead each expert is capable of performing inference independently as well as expressing its own uncertainty. On the other hand, the simple sum operation of the CGQN (baseline) probably end up with relying on dominating signals and ignore rests, which drove to overfit to training distributions.
Appendix F Computational Time
Model  # of parameters  timer per iter (ms)  
2  5  8  14  2  5  8  14  
baseline  53M  28M  48M  51M  346  397  481  866 
APoE  53M  29M  48M  51M  587  679  992  1189 
PoE  58M  53M  92M  131M  486  790  1459  2059 
Table 1 shows the number of parameters and computational time cost for all experiments. Each experiment is ran with single NVIDIA Tesla P100 GPU and four cores of an Intel Xeon E52650 2.20GHz CPU. PyTorch (Paszke et al., 2017), CUDA9.0 (Nickolls et al., 2008), and cuDNN7 (Chetlur et al., 2014) are used for the implementations. All models share the same representation and renderer network architectures, and the same number of steps and hidden sizes are applied to the encoder and decoder architectures. For fair comparison, the minibatch size is set to 1 for measuring the costs.
In PoE, each of the expert contains a large network like ConvDraw, resulting in space cost for inference networks. In APoE, the inference networks are integrated into single expertamortizer, serving for all modalities. Thus, the space cost of inference networks reduces to . As a result, the APoE model’s parameter size in the experiments is almost the same as the baseline’s, while it can provide probabilistic information integration that the PoE has.





