Mix and match networks: encoder-decoder alignment for zero-pair image translation
We address the problem of image translation between domains or modalities for which no direct paired data is available (i.e. zero-pair translation). We propose mix and match networks, based on multiple encoders and decoders aligned in such a way that other encoder-decoder pairs can be composed at test time to perform unseen image translation tasks between domains or modalities for which explicit paired samples were not seen during training. We study the impact of autoencoders, side information and losses in improving the alignment and transferability of trained pairwise translation models to unseen translations. We show our approach is scalable and can perform colorization and style transfer between unseen combinations of domains. We evaluate our system in a challenging cross-modal setting where semantic segmentation is estimated from depth images, without explicit access to any depth-semantic segmentation training pairs. Our model outperforms baselines based on pix2pix and CycleGAN models.
Image-to-image translations (or simply image translations) are an integral part of many computer vision systems. They include transformations between different modalities, such as from RGB to depth , or domains, such as luminance to color images , horses to zebras , or editing operations such as artistic style changes . These mappings can also include 2D label representations such as semantic segmentations  or surface normals . Deep networks have shown excellent results in learning models to perform image translations between different domains and modalities [1, 10, 21]. These systems are typically trained with pairs of matching images between domains, e.g. an RGB image and its corresponding depth image.
Image translation methods, which transfer images from one domain to another, are often based on encoder-decoder frameworks [1, 10, 21, 34]. In these approaches, an encoder network maps the input image from domain A to a continuous vector in a latent space. From this latent representation the decoder generates an image in domain B. The latent representation is typically much smaller than the original image size, thereby forcing the network to learn to efficiently compress the information from domain A which is relevant for domain B into the latent representation. Encoder-decoder networks are trained end-to-end by providing the network with matching pairs from both domains or modalities. An example could be learning a mapping from RGB to depth . Other applications include semantic segmentation  and image restoration .
In this paper we introduce zero-pair image translation: a new setting for testing image translations which involves evaluating on unseen translations, i.e. no matching image or dataset pairs are available during training (see Figure 1a). Note that this setting is different from unpaired image translation [34, 14, 20], which is evaluated on the same paired domains seen during training.
We also propose mix and match networks, an approach that addresses zero-pair image translation by seeking alignment between encoders and decoders via their latent spaces. An unseen translation between two domains is performed by simply concatenating the input domain encoder and the output domain decoder (see Figure 1b). We study several techniques that can improve this alignment, including the usage of autoencoders, latent space consistency losses and the usage of pooling indices as side information to guide the reconstruction of spatial structure. We evaluate this approach in a challenging cross-modal task, where we perform zero-pair depth to semantic segmentation translation, using only RGB to depth and RGB to semantic segmentation pairs during training.
Finally, we show that aligned encoder-decoder networks also have advantages in domains with unpaired data. In this case, we show that mix and match networks scale better with the number of domains, since they are not required to learn all pairwise image translation networks (i.e. scales linearly instead of quadratically). The code is available at http://github.com/yaxingwang/Mix-and-match-networks
2 Related Work
Recently, generic encoder-decoder architectures have achieved impressive results in a wide range of transformations between images. Isola et al.  trained from pairs of input and output images to learn a variety of image translations (e.g. color, style), using an adversarial loss. These models require paired training data to be available (i.e. paired image translation). Various works extended this idea to the case where no explicit input-output image pairs are available (unpaired image translation), using the idea of cyclic consistency [34, 14]. Liu et al.  show that unsupervised mappings can be learned by imposing a joint latent space between the encoder and the decoder. In this work we consider the case were paired data is available between some domains or modalities and not available between others (i.e. zero-pair), and how this knowledge can be transfered to those zero-pair cases. In concurrent work, Choi et al.  also address scaling to multiple domains (always in the RGB modality) by using a single encoder-decoder model. In contrast, our approach uses multiple cross-aligned encoders and decoders. Our cross-modal setting is also requires deeper structural changes and modality-specific encoder-decoders.
Encoder-decoder networks can be extended into multi-way encoder-decoder networks by adding encoders and/or decoders for multiple domains together. Recently, joint encoder-decoder architectures have been used in multi-task settings, where the network is trained to perform multiple tasks (e.g. depth estimation, semantic segmentation, surface normals) [5, 13], and multimodal settings, where the inputs data can be from different modalities or even combine several ones .
Training a multimodal encoder-decoder network was recently studied in . They use a joint latent representation space for the various modalities. In our work we consider the alignment and transferability of pairwise image translations to unseen translations, rather than joint encoder-decoder architectures. Another multimodal encoder-decoder network was studied in . They show that multimodal autoencoders can address the depth estimation and semantic segmentation tasks simultaneously, even in the absence of some of the input modalities. All these works do not consider the zero-pair image translation problem addressed in this paper.
In conventional supervised image recognition, the objective is to predict the class label that is provided during training [18, 8]. However, this poses limitations in scalability to new classes, since new training data and annotations are required. In zero-shot learning, the objective is to predict an unknown class for which there is no image available, but a description of the class (i.e. class prototype). This description can be a set of attributes(e.g. has wings, blue, four legs, indoor) [18, 11], concept ontologies [6, 27] or textual descriptions . In general, an intermediate semantic space is leveraged as a bridge between the visual features from seen classes and class description from unseen ones. In contrast to zero-shot recognition, we focus on unseen translations (unseen input-output pairs rather than simply unseen class labels).
Zero-pair language translation
Evaluating models on unseen language pairs has been studied recently in machine translation [12, 3, 33, 7]. Johnson et al.  proposed a neural language model that can translate between multiple languages, even pairs of language where no explicit paired sentences where provided111Note that  refers to this as zero-shot translation. In this paper we refer to this setting as zero-pair to emphasize that what is unseen is paired data and avoid ambiguities with traditional zero-shot recognition which typically refers to unseen samples.. In their method, the encoder, decoder and attention are shared. In our method we focus on images, which are essentially a radically different type of data, with two dimensional structure in contrast to the sequential structure of language.
3 Encoder-decoder alignment
3.1 Multi-domain image translation
We consider the problem of image translation from domain to domain as . In our case it is implemented as a encoder-decoder chain with encoder and decoder (see Figure 1). The domains connected during training are all trained jointly, and in both directions. It is important to note that for each domain one encoder and one decoder are trained. By training all these encoders and decoders jointly the latent representation is encouraged to align. As a consequence of the alignment of the latent space we can do zero-pair translation at testing time between the domains for which no training pairs were available. The main aim of this article is to analyze to what extend this alignment allows for zero-pair image translation.
3.2 Aligning for zero-pair translation
Zero-pair translation in images is especially challenging due to the inherent complexity of images, especially in multimodal settings. Ideally, a good latent representation that also works in unseen translations should be not only well-aligned but also unbiased to any particular domain. In addition, the encoder-decoder system should be able to preserve the spatial structure, even in unseen transformations.
Autoencoders One way to improve alignment is by jointly training domain-specific autoencoders with the image translation networks. By sharing the weights between the auto-encoders and the image translation encoder-decoders the latent space is forced to align.
Latent space consistency The latent space can be enforced to be invariant across multiple domains. Taigman et al.  use L2 distance between a latent representation and the reconstructed after another decoding and encoding cycle. When paired samples are available, we propose using cross-domain latent space consistency in order to enforce and to be aligned.
Preserving spatial structure using side information In general, image translation tasks result in output images with similar spatial structure as the input ones, such as scene layouts, shapes and contours that are preserved across modalities. In fact, this spatial structure available in the input image is critical to simplify the problem and achieve good results, and successful image translation methods often use multi-scale intermediate representations from the encoder as side information to guide the decoder in the upsampling process. Skip connections are widely used for this purpose. However, conditional decoders using skip connections expect specific information from a particular domain-specific encoder that would be unlikely to work in unseen encoder-decoder pairs. Motivated by efficiency, pooling indices  were recently proposed as a compact descriptor to guide the decoder in lightweight models. We show here that pooling indices constitute robust and relatively encoder-independent side information, suitable for improving decoding even in unseen translations.
Adding noise to latent space We found that adding some noise at the output of each encoder also helps to train the network and improves the results during test. This seems to help in obtaining more invariance in the common latent representation and better alignment across modalities.
3.3 Scalable image translation
One of the advantages of our mix and match networks is that the system can infer many pairwise domain-to-domain translations when the number of domains is high, without explicitly training them. Other pairwise methods where encoders and decoders are not cross-aligned, such as CycleGAN, would require training encoders and decoders for domains. For mix and match networks each encoder and decoder should be involved in at least one translation pair during training in order to be aligned with the others, thereby reducing complexity from quadratic to linear with the number of domains (i.e. encoders/decoders).
4 Zero-pair cross-modal image translation
We propose a challenging cross-modal setting to evaluate zero-pair image translation involving three modalities222Here the term modality has the same role of domain in the previous section.: RGB, depth and semantic segmentation. It is important to observe that a setting involving heterogeneous modalities333For simplicity, we will refer to semantic segmentation maps and depth as modalities rather than tasks (in terms of complexity, number and meaning of different channels, etc.) is likely to require modality-specific architectures and losses.
4.1 Problem definition
We consider the problem of jointly learning RGB-to-segmentation translation with and RGB-to-depth translation and evaluating on an unseen transformation . The first translation is learned from a semantic segmentation dataset with pairs of RGB images and segmentation maps, and the second from a disjoint RGB-D dataset with pairs of RGB and depth images . Therefore no depth image and segmentation map pairs are available to the system. However, note that the RGB images from both datasets could be combined if necessary (we denote this dataset as . The system is evaluated on a third dataset with paired depth images and segmentation maps.
4.2 Mix and match networks architecture
The overview of the framework is shown in Fig. 2. As basic building blocks we use three modality-specific encoders , and (RGB, depth and semantic segmentation, respectively), and the corresponding three modality-specific decoders , and , where is the latent representation in the shared space. The required translations are implemented as , and .
Encoder and decoder weights are shared across the different translations involving same modalities (same color in Fig. 2). To enforce better alignment between encoders and decoders of the same modality, we also include self-translations using the corresponding three autoencoders , and .
We based our encoders and decoders on the SegNet architecture . The encoder of SegNet itself is based on the 13 convolutional layers VGG-16 architecture . The decoder mirrors the encoder architecture with 13 deconvolutional layers. All encoders and decoders are randomly initialized except for the RGB encoder which is pretrained on ImageNet.
As in SegNet, pooling indices at each downsampling layer of the encoder are provided to the corresponding upsampling layer of the (seen or unseen) decoder444The RGB decoder does not use pooling indices, since in our experiments we observed undesired grid-like artifacts in the RGB output.. These pooling indices seem to be relatively similar across the three modalities and effective to transfer spatial structure information that help to obtain better depth and segmentation boundaries in higher resolutions. Thus, they provide relatively modality-independent side information.
4.3 Loss functions
As we saw before, a correct cross-alignment between encoders and decoders that have not been connected during training is critical for zero-pair translation. The final loss combines a number of modality-specific losses for both cross-domain translation and self-translation (i.e. autoencoders) and alignment constraints in the latent space
We use a combination of L2 distance and adversarial loss . L2 distance is used to compare the estimated and the ground truth RGB image after translation from a corresponding depth or segmentation image. It is also used in the RGB autoencoder
where is the resulting distribution of the combined images generated by , and . Note that the RGB autoencoder and the discriminator are both trained with the combined RGB data .
For depth we use the Berhu loss  in both RGB-to-depth translation and in the depth autoencoder
where is the average Berhu loss.
For segmentation we use the average cross-entropy loss in both RGB-to-segmentation translation and the segmentation autoencoder
Latent space consistency
We enforce latent representations to remain close independently of the encoder that generated them. In our case we have two consistency losses
5 Experimental Results
To the best of our knowledge there is no existing work which reports results for the setting of zero-pair image translation. In particular, we evaluate the proposed mix and match networks on zero-pair translation for semantic segmentation from depth images (and viceversa), and we show results for semantic segmentation from multimodal data. Finally, we illustrate the possibility to perform zero-pair translations for unpaired datasets, and the advantage of mix and match networks in terms of scalability.
5.1 Datasets and experimental settings
SceneNetRGBD The SceneNetRGBD dataset  consists of 16865 synthesized train videos and 1000 test videos. Each of them includes 300 matching triplets (RGB, depth and segmentation map), with a size of 320x240 pixels. In our examples, we use two subsets as our datasets:
51K dataset: the train set is selected from the first 50 frames from each of the first 1000 videos in the train set. The test set is collected by selecting the 60th frame from the same 1000 videos. This dataset was used to evaluate some of the architecture design choices.
170K dataset: We collected a larger dataset which consists of 150K triplets for the train set, 10K triplets for the validation set and 10K triplets for the test set. The 10K validation set is also from the train set of SceneNetRGBD. For the train set, we select 10 triplets from the first 150000 training triplets. The triplets are sampled from the first frame to last frame every 30 frames. The validation set is from the remaining videos of the train set and the test set is taken from the test dataset.
Each train set is divided into two equal non-overlapping splits from different videos. Although the collected splits contain triplets, we only use pairs to train our model.
Following common practices in these tasks, for segmentation we compute the intersection over union (IoU) and report per-class average (mIoU), and the global scores, which gives the percentage of correctly classified pixels. For depth we also include quantitative evaluation, following the standard error metrics for depth estimation .
Network training We use Adam  with batch size of 6, using a learning rate of 0.0002. We set , , , , . For the first 200,000 iterations we train all networks. For the following 200,000 iterations we use , , and freeze the RGB encoder. We found the network converges faster using a large initial . We add Gaussian noise to the latent space with zero mean and a standard deviation of 0.5.
5.2 Ablation study
In a first experiment we use the 51K dataset to study the impact of several design elements on the overall performance of the system.
Side information We first evaluate the usage of side information from the encoder to guide the upsampling process in the decoder. We consider three variants: no side information, skip connections  and pooling indices . The results in Table 1 show that skip connections obtain worse results than no side information at all. This is due to the fact that skip connections are not domain-invariant and at testing time when we combine an encoder and decoder these connections result in a different input from the one seen under training, resulting in a drop of performance. Fig. 3 illustrates the differences between these three variants. Without side information the network is able to reconstruct a coarse segmentation but without further guidance it is not able to refine it properly. Skip connections provide features that could guide the decoder but instead confuse it, since in the zero-pair case the decoder has not seen the features of that particular encoder. Pooling indices are more invariant as side information and obtaining the best results.
RGB pretraining We also compare training the RGB encoder from scratch and initializing with pretrained weights from ImageNet. Table 1 show an additional gain of around 5% in mIoU when using the pretrained weights.
Given these results we perform all the remaining experiments initializing the RGB encoder with pretrained weights and use pooling indices as side information.
Latent space consistency, noise and autoencoders We evaluate these three factors, with Table 2 showing that latent space consistency and the usage of autoencoders lead to significant performance gains; for both, the performance (in mIoU) is more than doubled. Adding noise to the output of the encoder results in a small performance gain.
5.3 Comparison with other methods
We compare the results of our mix and match networks for depth to segmentation, , to the following two baselines:
CycleGAN  learns a mapping from depth to semantic segmentation without explicit pairs. In contrast to ours, this method only leverages depth and semantic segmentation, ignoring the available RGB data and the corresponding pairs.
2pix2pix  learns from paired data two encoder-decoder pairs ( and ). The architecture uses skip connections and the corresponding modality-specific losses. We use the exact code from . In contrast to ours, it requires explicit decoding to RGB, which may degrade the quality of the prediction.
is similar as the 2pix2pix but than with a similar architecture as we use in our M&MNet. We train a translation model from depth to RGB and from RGB to segmentation, and obtain the transformation depth-to-segmentation by concatenating them. Note that it requires using an intermediate RGB image.
Table 3 compares the three methods on the 170K dataset. CycleGAN is not able to learn a good mapping from depth to semantic segmentation, showing the difficulty of unpaired translation to solve this complex cross-modal task. 2pix2pix manages to improve the results by resorting to the anchor domain RGB, although still not satisfactory since the first translation network drops information not relevant for the RGB task but necessary for reconstructing depth (like in the âChinese whispersâ/telephone game).
Mix and match networks evaluated on () achieve a similar result to CycleGAN, but significantly worse than 2pix2pix. However, when we run our architecture with skip connections we obtain similar results as 2pix2pix. Note that because in this setting the encoders and decoders are used in the same setting in both training and testing, skip connections function well.
The direct combination () outperforms all baselines significantly. The results more than double in terms of mIoU. Figure 4 illustrates the comparison between our approach and the baselines; our method is the only one that manages to identify the main semantic classes and their general contours in the image. In conclusion, the results show that mix and match networks enable effective zero-pair translation.
In Table 4 we show the results when we test in the opposite direction from semantic segmentation to depth. The conclusions are similar as in previous experiment. Again our method outperforms both baseline methods on all five evaluation metrics. Fig. 5 illustrates this case, showing how pooling indices are also key to obtain good depth images, compared with no side information at all.
5.4 Multimodal translation
Next we consider the case of multimodal translation from pairs (RGB, depth) to semantic segmentation. As depicted in Figure 2 multiple modalities can be combined (since the latent spaces are aligned) at the input of semantic segmentation decoder. To combine the two modalities we perform a weighted average of both RGB and depth latent vectors (the weight ranges from , only RGB, and , only depth, depending on the case). We set to 0.2 and use the pooling indices from the RGB encoder (instead of those from the depth encoder, see supplementary material for more details). The results in Table 3 and Table 4 and the example in Figure 5 show that this multimodal combination further improves the performance of zero-pair translation.
5.5 Scalable unpaired image translation
As explained in Section 3.3, mix and match networks scale linearly with the number of domains, whereas existing unpaired image translation methods scale quadratically. As examples of translations between many domains, we show results for object recoloring and style transfer, using mix and match networks based on multiple CycleGANs  combined with autoencoders. For the former we use the colored objects dataset  with eleven distinct colors () and around 1000 images per color. Covering all possible image-to-image recoloring combinations requires training a total of encoders (and decoders) using CycleGANs. In contrast, mix and match networks only require to train encoders and eleven decoders, while still successfully addressing the recoloring task (see Figure 5(a)). Similarly, scalable style transfer can be addressed using mix and match networks (see Figure 5(b) for an example).
In this paper we introduce the problem of zero-pair image translation, where knowledge learned in paired translation models can be effectively transferred and leveraged to perform new unseen translations. The image-to-image scenario poses several challenges to the alignment of encoders and decoders in a way that guarantees cross-domain transferability and without too much dependence on the domain or the modality. We studied this scenario in zero-pair cross-modal and multimodal settings. Notably, we found that side information in the form of pooling indices is robust to modality changes and very helpful to guide the reconstruction of spatial structure. Other helpful techniques are cross-modal consistency losses and adding noise to the latent representation.
Acknowledgements Herranz acknowledges the European Unionâs H2020 research under Marie Sklodowska-Curie grant No. 6655919. We acknowledge the project TIN2016-79717-R, the CHISTERA project M2CR (PCIN-2015-251) of the Spanish Government and the CERCA Programme of Generalitat de Catalunya. Yaxing Wang acknowledges the Chinese Scholarship Council (CSC) grant No.201507040048. We also acknowledge the generous GPU support from Nvidia.
-  V. Badrinarayanan, A. Handa, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293, 2015.
-  C. Cadena, A. R. Dick, and I. D. Reid. Multi-modal auto-encoders as joint estimators for robotics scene understanding. In Robotics: Science and Systems, 2016.
-  Y. Chen, Y. Liu, Y. Cheng, and V. O. Li. A teacher-student framework for zero-resource neural machine translation. arXiv preprint arXiv:1705.00753, 2017.
-  Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
-  R. Fergus, H. Bernal, Y. Weiss, and A. Torralba. Semantic label sharing for learning with many categories. Computer Vision–ECCV 2010, pages 762–775, 2010.
-  O. Firat, K. Cho, and Y. Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073, 2016.
-  Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, and S. Gong. Recent advances in zero-shot recognition. arXiv preprint arXiv:1710.04837, 2017.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  D. Jayaraman and K. Grauman. Zero-shot recognition with unreliable attributes. In Advances in Neural Information Processing Systems, pages 3464–3472, 2014.
-  M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. Google’s multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558, 2016.
-  A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 2017.
-  T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  R. Kuga, A. Kanezaki, M. Samejima, Y. Sugano, and Y. Matsushita. Multi-task learning using multi-modal encoder-decoder networks with shared skip connections. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2017.
-  I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
-  C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
-  F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence, 38(10):2024–2039, 2016.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848, 2017.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang. Multi-class generative adversarial networks with the l2 loss function. arXiv preprint arXiv:1611.04076, 2016.
-  X. Mao, C. Shen, and Y.-B. Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems, pages 2802–2810, 2016.
-  J. McCormac, A. Handa, S. Leutenegger, and A. J.Davison. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? 2017.
-  J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696, 2011.
-  S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49–58, 2016.
-  M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1641–1648. IEEE, 2011.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016.
-  L. Yu, L. Zhang, J. van de Weijer, F. S. Khan, Y. Cheng, and C. A. Parraga. Beyond eleven color names for image understanding. Machine Vision and Applications, 29(2):361–373, 2018.
-  R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European Conference on Computer Vision, pages 649–666. Springer, 2016.
-  H. Zheng, Y. Cheng, and Y. Liu. Maximum expected likelihood estimation for zero-resource neural machine translation. IJCAI, 2017.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision,ICCV, 2017.
Appendix A Appendix: Network architecture
Table 7 shows the architecture (convolutional and pooling layers) of the encoders used in the cross-modal experiment. Tables 8 and 5 show the corresponding decoders. Table 6 shows the discriminator used for RGB. . Every convolutional layer of the encoders, decoders and the discriminator is followed by a batch normalization layer and a ReLU layer (LeakyReLU for the discriminator). The only exception is the RGB encoder, which is is initialized with weights from the VGG16 model and does not use batch normalization.
|layer||Input Output||Kernel, stride|
|conv1||[6,8,8,512]||[3, 3], 1|
|conv2||[6,16,16,512]||[3, 3], 1|
|conv3||[6,32,32,256]||[3, 3], 1|
|conv4||[6,64,64,128]||[3, 3], 1|
|conv5||[6,128,128,64]||[3, 3], 1|
|layer||Input Output||Kernel, stride|
|deconv1||[5, 5], 2|
|deconv2||[5, 5], 2|
Appendix B Appendix: Multimodal fusion
Figure 7 shows the performance for different values of for multimodal semantic segmentation. It also compares the performance when the semantic segmentation decoder uses the pooling indices from the depth encoder instead of the ones from the RGB encoder.
|Layer||Input Output||Kernel, stride|
|conv1 (RGB)||[6,256,256,3] [6,256,256,64]||[3,3], 1|
|conv1 (Depth)||[3,3], 1|
|conv1 (Segm.)||[6,256,256,14] [6,256,256,64]||[3,3], 1|
|conv2||[6,256,256,64] [6,256,256,64]||[3,3], 1|
|pool2 (max)||[6,256,256,64] [6,128,128,64]+indices2||[2,2], 2|
|conv3||[6,128,128,64] [6,128,128,128]||[3,3], 1|
|conv4||[6,128,128,128] [6,128,128,128]||[3,3], 1|
|pool4 (max)||[6,128,128,128] [6,64,64,128]+indices4||[2,2], 2|
|conv5||[6,64,64,128] [6,64,64,256]||[3,3], 1|
|conv6||[6,64,64,256] [6,64,64,256]||[3,3], 1|
|conv7||[6,64,64,256] [6,64,64,256]||[3,3], 1|
|pool7 (max)||[6,64,64,256] [6,32,32,256]+indices7||[2,2], 2|
|conv8||[6,32,32,256] [6,32,32,512]||[3,3], 1|
|conv9||[6,32,32,512] [6,32,32,512]||[3,3], 1|
|con10||[6,32,32,512] [6,32,32,512]||[3,3], 1|
|pool10 (max)||[6,32,32,512] [6,16,16,512]+indices10||[2,2], 2|
|conv11||[6,16,16,512] [6,16,16,512]||[3,3], 1|
|conv12||[6,16,16,512] [6,16,16,512]||[3,3], 1|
|conv13||[6,16,16,512] [6,16,16,512]||[3,3], 1|
|pool13 (max)||[6,16,16,512] [6,8,8,512]+indices13||[2,2], 2|
|layer||Input Output||Kernel, stride|
|unpool1||indices13 + [6,8,8,512]||[2, 2], 2|
|unpool4||indices10 + [6,16,16,512]||[2, 2], 2|
|unpool7||indices7 + [6,32,32,256]||[2, 2], 2|
|unpool10||indices4 + [6,64,64,128]||[2, 2], 2|
|unpool12||indices2 + [6,128,128,64]||[2, 2], 2|
|conv13 (Depth)||[6,256,256,64]||[3,3], 1|
|conv13 (Segm.)||[6,256,256,64]||[3,3], 1|