Less is More: Unified Model for Unsupervised Multi-Domain Image-to-Image Translation

Less is More: Unified Model for Unsupervised Multi-Domain Image-to-Image Translation

Xiao Liu
Xiamen University
xiaoliu95@outlook.com &Shengchuan Zhang
Xiamen University
zsc_2016@xmu.edu.cn &Hong Liu
Xiamen University
lynnliu.xmu@gmail.com \ANDXin Liu
Xiamen University
xinliu@stu.xmu.edu.cn &Rongrong Ji*
Xiamen University
rrji@xmu.edu.cn
Abstract

In this paper, we aim at solving the multi-domain image-to-image translation problem by a single GAN-based model in an unsupervised manner. In the field of image-to-image translation, most previous works mainly focus on adopting a generative adversarial network, which contains three parts, i.e., encoder, decoder and discriminator. These three parts are trained to give the encoder and the decoder together as a translator. However, the discriminator that occupies a lot of parameters is abandoned after the training process, which is wasteful of computation and memory. To handle this problem, we integrate the discriminator and the encoder of the traditional framework into a single network, where the decoder in our framework translates the information encoded by the discriminator to the target image. As a result, our framework only contains two parts, i.e., decoder and discriminator, which effectively reduces the number of the parameters of the network and achieves more effective training. Then, we expand the traditional binary-class discriminator to the multi-classes discriminator, which solves the multi-domain image-to-image translation problem in traditional settings. At last, we propose the label encoder to transform the label vector to high-dimension representation automatically rather than designing a one-hot vector manually. We performed extensive experiments on many image-to-image translation tasks including style transfer, season transfer, face hallucination, etc. A unified model was trained to translate images sampled from 14 considerable different domains and the comparisons to several recently-proposed approaches demonstrate the superiority and novelty of our framework.

 

Less is More: Unified Model for Unsupervised Multi-Domain Image-to-Image Translation


  Xiao Liu Xiamen University xiaoliu95@outlook.com Shengchuan Zhang Xiamen University zsc_2016@xmu.edu.cn Hong Liu Xiamen University lynnliu.xmu@gmail.com Xin Liu Xiamen University xinliu@stu.xmu.edu.cn Rongrong Ji* Xiamen University rrji@xmu.edu.cn

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

The automatic image-to-image translation problem was defined as translating the input image into the corresponding output image by giving sufficient training data (isola2017image, ). A lot of researches on this field were conducted after the introduction of generative adversarial networks (GAN) (goodfellow2014generative, ). Isola et al. (isola2017image, ) proposed a general-purpose image-to-image translation model in a supervised manner by using conditional adversarial networks (CGAN) (mirza2014conditional, ). CycleGAN (zhu2017unpaired, ), DualGAN (yi2017dualgan, ) and DiscoGAN (kim2017learning, ) translated images in an unsupervised manner by training two GANs. However, aforementioned researches focus on two-domain image-to-image translation, where one model only translates one domain to the other domain. When there are multiple domains, these approaches are limited and inconvenient, since every pair of image domains requires a model to be built independently.

As a scalable approach, StarGAN (choi2017stargan, ) used a single model containing one encoder, one decoder and one discriminator to translate multi-domain facial images in an unsupervised manner. In principle, the classification loss and the adversarial loss in StarGAN force the generated images to fall inside the target domain. StarGAN performs well on the task of face attribute modification, where all the domains were slight shifts in high-dimension representation. However, this method shows poor performance towards general translations such as style transfer in experiments. The problem of general multi-domain image-to-image translation remains unsolved by a single and simple model.

On the other hand, most of the GAN-based image-to-image translation models (isola2017image, ; zhu2017unpaired, ; yi2017dualgan, ; kim2017learning, ; choi2017stargan, ; liu2017unsupervised, ) adopt the Encoder-Decoder-Discriminator (E-D-D) architecture (Fig.1(a)), where the discriminator has only one function that distinguishes if the input is real or fake in the training procedure. In the testing step, the discriminator is abandoned, which means that a encoder and a decoder are gained by training three networks. Moreover, the discriminator usually occupies most of the parameters of the E-D-D network. For example, StarGAN (128128 image translation) has 53.2 million parameters and the discriminator takes up about 44 million parameters.

Figure 1: Comparison between (a)E-D-D architecture and (b)our proposed D-D architecture.

To address the problems mentioned above, we first propose a new Decoder-Discriminator (D-D) architecture (Fig.1(b)). As the encoder in the traditional framework has similar architecture with the discriminator, in our framework, the encoder is removed and the discriminator also serves as the encoder to allow the decoder to translate the information encoded by the discriminator. We then propose the multi-classes adversarial discriminator to solve the general multi-domain translation problems. To be specific, the real images are classified in multiple domains , where is the number of domains. In contrast to the discriminator in the vanilla GAN which outputs a number to determine if the input is real, our proposed multi-classes discriminator outputs a vector , a vector. is the element of , which is defined to tell if the input is real . The multi-classes adversarial discriminator can be considered as binary discriminators sharing most of weights and the discriminator is responsible to tell if the input is real .

Finally, we propose a small encoder network to encode the label vector to high-dimension label vector. As we suppose the label information has the same importance with the encoded image information, the label vector has to be a high-dimension vector/matrix like the encoded image vector/matrix. Unlike that CGAN and StarGAN design the high-dimension label vector manually, our solution aims at training a two-layer auto-encoder to encode the label vector, where is the number of domains.

Figure 2: Our proposed framework. is the label encoder and ‘’ means concatenation. The output of discriminator is a vector rather than a number. The decoder translates the feature maps of the discriminator and the encoded label information to the image.

To verify our proposed approaches, we performed extensive experiments on many datasets. Qualitative results are achieved by our framework and reasonable comparisons to several recently-proposed approaches illustrate the superiority of our method. Note that all our results in this paper are generated by a single model, which is trained by feeding multiple datasets.

Overall, our framework summarizing the D-D architecture, the multi-classes discriminator and the label encoder is illustrated in Fig.2. Our contributions are concluded as follows:

  • We substitute the encoder by the discriminator to encode the input image, which is a novel way to extend the discriminator of GAN to do the work of encoding. To the best of our knowledge, this method is first proposed in the field of GAN and performs very well in our experiments.

  • We propose the multi-classes discriminator that allows our framework to solve the general multi-domain translation problem.

  • We encode the low-dimension domain label vector to the high-dimension label vector automatically by a small label encoder network, which ease the workload by avoiding the manually designed high-dimension vector.

2 Related Work

Generative Adversarial Networks (goodfellow2014generative, ; zhao2016energy, ) have been widely used in many computer vision tasks including image generation (zhao2016energy, ; arjovsky2017wasserstein, ; huang2017stacked, ; radford2015unsupervised, ), super-resolution (ledig2016photo, ) and image-to-image translation (isola2017image, ; zhu2017unpaired, ; yi2017dualgan, ; kim2017learning, ; choi2017stargan, ; liu2017unsupervised, ). The discriminator of GAN learns to distinguish if the input samples are real or fake, which is called binary discriminator in this paper. The generator is trained to translate the inputs to fake images to fool the discriminator. As most of computer vision tasks can be considered as translations, training a generator as a translator by giving prepared data and optimizing the GAN loss is a general solution to these problems. In our framework, we use the adversarial loss to train the generator to produce target-domain images as realistic as possible.

Conditional GANs (CGAN) are proposed to generate samples by giving conditional information. As a conditional version of GAN, CGAN simply feeds the data and its corresponding condition information to both the generator and the discriminator. The discriminator in CGAN is responsible to distinguish if the sample is paired with the condition information. The generator learns to generate corresponding samples when given certain conditions. MAD-GAN (ghosh2017multi, ) proposes that the discriminator outputs a vector rather than a number to distinguish if the sample is real and identify the generator that the fake sample comes from. MAD-GAN can also be regarded as a conditional version of GAN, which uses the condition to train the discriminator to identify the generator, rather than feeding the condition information and the data to the discriminator directly. The discriminator of MAD-GAN shares some similar ideas with our proposed multi-classes discriminator. The difference is that the discriminator in our framework outputs a vector that distinguishes if the input is real and identifies the domain of the input.

Adversarial Auto-Encoder (AAE) (makhzani2015adversarial, ) uses the adversarial training method to match the distribution of the encoded code with the arbitrary prior distribution in the latent space. By training the generator to map the prior distribution to the data distribution, AAE can be applied in many tasks such as dimensionality deduction, clustering and data visualization. The architecture of our framework seems similar to AAE, but the training process and the principle are totally different. The discriminator in our framework also serves as the encoder with the main role to match the generated data distribution with the target data distribution.

Image-to-Image Translation is a crucial sub-field of computer vision, since a lot of tasks of computer vision can be classified as image-to-image translation problems such as super-resolution (ledig2016photo, ) (low-resolution to high-resolution), style transfer (johnson2016perceptual, ) (photo to artistic style), face hallucination (wang2014comprehensive, ) (face photo/sketch to sketch/photo) and so on. pix2pix (isola2017image, ) introduced a conditional adversarial networks-based framework as a general-purpose solution to image-to-image translation problem in a supervised manner. As pix2pix needs paired data, researchers proposed unsupervised image-to-image translation frameworks (zhu2017unpaired, ; liu2017unsupervised, ) that are trained by unpaired data. CycleGAN, DualGAN and DiscoGAN train two GANs to learn the translation mappings between two domains by utilizing cycle consistency loss (zhu2017unpaired, ). StarGAN made an impressive step in the field of multi-domain translations by using a single model to learns all the translation mappings among multiple face attributes.

3 The Proposed Methods

Our goal is to train one decoder and one discriminator to implicitly match the mapping functions among all the image domains. Before training the discriminator and decoder, we train the label encoder to transform the low-dimension target label to high-dimension representation, which is demonstrated in Sec.3.1. Then, we suppose a set of images sampled from , where is the number of domains. We denote the data distribution . Fig.3 shows the training process of translation . In every iteration, the discriminator is first trained to encode the input and distinguish if the input is real . Then, the decoder is trained to fool the discriminator by feeding the encoded information of the input image and the target label information. The objective function for training the discriminator and the decoder contains two terms: the multi-classes adversarial loss for forcing the generated samples to be in the target domain; and reconstruction loss for conserving the content and structure information from input images to generated images.

Figure 3: The training process of . is the label encoder network illustrated in Sec.3.1 and ‘’ means calculating the distance of the two inputs. (a) learns to distinguish real and fake . The input of is the feature maps of and the encoded label vector of . The element of the adversarial vector is responsible for distinguishing real and fake . We only optimize in this step. (b) learns to translate the encoded information of to by giving the label vector of . tries to fool the to make fake indistinguishable with real . Then learns to translate the encoded information of fake to fake by giving the encoded label vector of . The distance between real and fake is the reconstruction loss. We only optimize in this step.

3.1 Label Encoder Network

CGAN used one-hot vector to present the condition information, where the vector is designed manually. StarGAN utilized the mask vector method to conclude the label information of multiple datasets. Both of them used binary codes to encode the condition information, which is insufficient when the number of domains is increased significantly. In our framework, before training the decoder and the discriminator, we train a two-layer auto-encoder to obtain the label encoder network to encode the low-dimension label vector to high-dimension vector. The details of the label encoder network are shown in Appendix (Sec.A.3 ).

As we assume that the target label information has the same importance as the input image, the high-dimension label vector has the same dimension as the encoded image vector. is the number of domains and is a vector.

(1)

where . Eq.(1) gives the target label vector of the domain. The label encoder network is denoted as .

(2)

has the same dimension as the encoded image vector. We concatenate the encoded target label vector and the encoded image vector as the input of the first layer of the decoder. The label encoder network and the decoder are considered as one network, which is the generator in our framework.

3.2 Combining the encoder and discriminator

In the vanilla GAN framework, the discriminator plays a role of ‘teacher’ that distinguishes if the input is real. Most models of image-to-image translations use the adversarial loss to generate more realistic images after the introduction of CGAN and DCGAN (radford2015unsupervised, ), where the generator and the discriminator are deep convolutional neural networks. As far as we are concerned, the discriminator in most models conducts similar kinds of works as the encoder and the two networks usually have similar architectures, which are deep convolutional neural networks. Inspired by this, in our framework, we remove the encoder and make the discriminator do its parts. The feature maps of the discriminator are regarded as the encoded information of the input image, which becomes the input of the decoder. We define that the decoder and the label encoder in our framework are the generator and the discriminator is denoted as . We then denote the feature maps of the discriminator as by feeding the image sampled from the distribution . is the translated image by feeding under the condition of the target label .

3.3 Multi-Classes Adversarial Loss

We use the adversarial loss to train the model to generate images indistinguishable from real target-domain images. In terms of multi-domain translation, the initial GAN loss is not suitable because the discriminator only classifies the inputs as real or fake. To address this problem, we propose multi-classes adversarial loss:

(3)

where is the element of the adversarial vector shown in Fig.3, which is responsible to distinguish if the input image is real . The discriminator is trained by maximizing this objective function and the generator is trained by minimizing it. It is important that the discriminator is fixed when training the generator.

3.4 Reconstruction Loss

Minimizing the adversarial loss enables the generator to output realistic target-domain images. However, the content and structure information of the input image cannot be conserved to the output. To address this problem, the reconstruction loss is applied to the generator.

(4)

where is the encoded information of the translated image. translates it back to the -domain image, . Then the distance between the initial input image and the reconstructed image is defined as reconstruction loss.

3.5 Full Object

The full objective function of and is written as:

(5)
(6)

where donates that the generator translates the input to the domain. is the hyper-parameter that controls the influence of the reconstruction loss to the full object function. is 100 in all our experiments.

4 Implementation

Network Architecture. We adopt U-Net (ronneberger2015u, ) as the basic architecture. The input images are first down-sampled by 6 stride-2 convolutions for 128128 images. The feature maps and the encoded target label vector are regarded as the input of the decoder which consists of 6 fractionally strided convolutions with stride 1/2. Then, one stride-2 convolution layer processes the encoded vector of the input image to the adversarial vector.

Training Details. We performed experiments on 8 datasets that consist of 14 considerable different domains. The training images conclude: 1096 aerial maps and 1096 aerial photographs from Google Maps (isola2017image, ); 2975 cityscape semantic label images and 2975 cityscape photos from the Cityscapes training set (cordts2016cityscapes, ); 400 architecture labels and 400 architecture photos from (tylevcek2013spatial, ); 1273 summer photos and 854 winter photos from (zhu2017unpaired, ); 401 paintings of Van Gogh, 1074 paintings of Monet, 584 paintings of Cezanne and 6853 photographs from (zhu2017unpaired, ); 88 face photos and 88 face sketches from CUHK Students dataset (wang2009face, ).

We randomly chose the pairs of domains to train the model in every epoch. The model was trained with epochs, where is the number of domains. We use Adam Solver (kingma2014adam, ) with and and the batch size is 4 in all our experiments. The learning rate is 0.0002 in the first half of training procedure. Then it linearly decays to 0 in the next half. Please see the Appendix (Sec.A) for more details of network architectures and training details.

5 Experiments

In this section, we compare our method with baseline models including pix2pix, CycleGAN and StarGAN on multiple translations by conducting an user study. The results of our model are qualitatively and quantitatively close to CycleGAN and much better than StarGAN, which domenstates that our proposed framework can be a general-purpose method for multi-domain image-to-image translations. Then, we demonstrate the superiority of our method by analyzing the results generated by our model and comparing the number of parameters with baseline models. Qualitative and quantitative comparisons are illustrated in Sec.5.2.

5.1 Baseline Models

We compare our results with several relative baseline models. pix2pix is a general-purpose two-domain image-to-image translation model in a supervised manner and CycleGAN is a typical model in an unsupervised manner. We compare our model to pix2pix and CycleGAN to demonstrate that our method is able to translate images between considerable different domains. We also compare our model to StarGAN to illustrate that our framework is a general-purpose method in the field of multi-domain image-to-image translation.

pix2pix has a similar architecture with our framework except the encoder. pix2pix optimizes the reconstruction loss and the adversarial loss to train the model. As pix2pix is trained on paired data, we suppose that the results of pix2pix have the upper-bound quality. Comparing our results to pix2pix is for illustrating that how close we can reach that.

CycleGAN learns two generators to find the mappings between two domains and . CycleGAN applies cycle consistency loss and to force the generated images to keep the content and structure information of inputs. Meanwhile, the adversarial losses for both GANs are used to train the generator to generate images as realistic as possible.

StarGAN utilizes one discriminator, one encoder and one decoder to learn the mappings between multiple domains. By minimizing the classification loss and the adversarial loss, the outputs of the generator are constrained into the target domains. The cycle consistency loss in StarGAN has the similar form as CycleGAN.

5.2 Experimental Results

We first train pix2pix and CycleGAN multiple times by feeding different datasets. Then, we train our proposed model and StarGAN as unified models, where a single model handles all tasks. StarGAN was trained on 6 domains and our model was trained on 14 domains. All models are trained to translate 128128 images.

5.2.1 Qualitative Evaluation

In Fig.4, we compare our results with pix2pix, CycleGAN and StarGAN. Qualitatively, the results of our model have a small gap to supervised mehtod, pix2pix. However, the results of our method have similar visual quality comparing with CycleGAN, but we only need a single model to handle these tasks. As the figure shows, facade labels photos, cityscape labels photos and aerial maps photos are considerable different. StarGAN fails to translate facades photo labels and cityscape labels photos and has poor performance in other tasks. Comparing to StarGAN, we successfully translate all tasks by the unified model.

Figure 4: The comparisons between our model with the baseline models. From up to down: facade labels photos, cityscape labels photos and aerial maps photos

As we randomly crop the training data to 128128 to train the model, our model is able to generate high-resolution images (512512). We run our model convolutionally on the 512512 aerial maps at test time. Fig.5 shows the high-resolution images generated by our model.

Figure 5: The high-resolution images generated by our model.

We train a single model to learn the mappings between 14 domains. Fig.6 and Fig.7 shows the results of season transfer, face hallucination and style transfer. Note that there are 8 datasets including 14 domains because Van Gogh Photos, Monet Photos and Cezanne Photos use the same set of photographs as the training data. On the other hand, although, there are only 88 pairs of training images in CUHK Students dataset. Our model successfully translate this task without any specific setting.

Figure 6: Results of Season Transfer and Face Hallucination
Figure 7: Results of Style Transfer

5.2.2 Quantitative Evaluation

For quantitative evaluations, we performed a user study to assess the quality of the translation between facade labels and photos. In the survey, the users were instructed to choose the best generated image of the input image based on perceptual realism and the consistency of content and structure. In each question, we gave a pair of images of the translation task as a example to the user. One question had three options, which were three images generated by CycleGAN, StarGAN and our model. We asked the user which option was the best translated image by giving the input image. Moreover, there were a few logical questions for validating human effort.

Method Label Photo Photo Label Average Parameters
CycleGAN 33.67% 51.53% 42.6% 52.6M16
StarGAN 9.18% 2.55% 5.87% 53.2M1
Ours 57.15% 45.92% 51.53% 34.1M1
Table 1: Quantitative Evaluation

As Tabel 1 shows, both of our model and CycleGAN obtained around half of votes for the best generated images and few users voted for StarGAN, which indicates that the quality of the generated images of our model are much better than StarGAN and close to CycleGAN. On the other hand, as we adopt our proposed D-D architecture, the number of parameters of our model is much less than CycleGAN and StarGAN. Our model has only 34.1M parameters, which is 64.1% of StarGAN and 0.041% of CycleGAN.

6 Conclusion and Future Work

In this paper, we proposed the Decoder-Discriminator architecture that is a novel architecture for image-to-image translation. To solve the multi-domain image-to-image translation problem, we proposed the multi-classes discriminator that is capable of matching the generated distributions to the target distributions. In addition, the label encoder was proposed to encode the label vector automatically by training the two-layer auto-encoder network. Overall, a unified model successfully complete multiple translations including style transfer, season transfer, face hallucination, etc.

In principle, our proposed framework is competent to translate much more tasks with a unified model. Training a more general model and improving the quality of generated images will be our future work.

Acknowledgments: We thank Miss Qisi Zhang for helpful discussions and her work on proof-reading. This work is supported by the National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), Nature Science Foundation of China (No.U1705262, No.61772443, and No.61572410), Post Doctoral Innovative Talent Support Program under Grant BX201600094, China Post-Doctoral Science Foundation under Grant 2017M612134, Scientific Research Project of National Language Committee of China (Grant No. YB135-49), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).

References

  • (1) P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in CVPR, 2017 IEEE Conference on, 2017.
  • (2) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
  • (3) M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
  • (4) J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networkss,” in ICCV, 2017 IEEE International Conference on, 2017.
  • (5) Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-image translation,” arXiv preprint, 2017.
  • (6) T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” arXiv preprint arXiv:1703.05192, 2017.
  • (7) Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” arXiv preprint arXiv:1711.09020, 2017.
  • (8) M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Advances in Neural Information Processing Systems, pp. 700–708, 2017.
  • (9) J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adversarial network,” arXiv preprint arXiv:1609.03126, 2016.
  • (10) M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
  • (11) X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie, “Stacked generative adversarial networks,” in CVPR, vol. 2, p. 4, 2017.
  • (12) A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
  • (13) C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” arXiv preprint, 2016.
  • (14) A. Ghosh, V. Kulharia, V. Namboodiri, P. H. Torr, and P. K. Dokania, “Multi-agent diverse generative adversarial networks,” arXiv preprint arXiv:1704.02906, 2017.
  • (15) A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
  • (16) J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, pp. 694–711, Springer, 2016.
  • (17) N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “A comprehensive survey to face hallucination,” International journal of computer vision, vol. 106, no. 1, pp. 9–30, 2014.
  • (18) O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, pp. 234–241, Springer, 2015.
  • (19) M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on CVPR, pp. 3213–3223, 2016.
  • (20) R. Tyleček and R. Šára, “Spatial pattern templates for recognition of objects with regular structure,” in GCPR, pp. 364–374, Springer, 2013.
  • (21) X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” PAMI, vol. 31, no. 11, pp. 1955–1967, 2009.
  • (22) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • (23) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • (24) A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML, p. 3, 2013.
  • (25) V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th ICML, pp. 807–814, 2010.

Appendix A Appendix

a.1 Network Architecture

The network architectures for 128128-image translation are listed below in Table 2 and Table 3. We use batch normalization (ioffe2015batch, ) in all layers of the discriminator except the first layer and the output layer. Leaky ReLU (maas2013rectifier, ) with a negative scope of 0.2 is used in all layers of the discriminator except the output layer. For the architecture of the decoder, we use batch normalization and ReLU (nair2010rectified, ) in all layers except the last layer, where we use Tanh function.

Part Input Shape-> Output Shape Layer Information
Down-Sampling() (3,128,128)-> (64,128,128) CONV-(O:64,K:7x7,S:1,P:3), Leaky ReLU
(64,128,128)-> (64,64,64) CONV-(O:64,K:4x4,S:2,P:1), BN, Leaky ReLU
(64,64,64)-> (128,32,32) CONV-(O:128,K:4x4,S:2,P:1), BN, Leaky ReLU
(128,32,32)-> (256,16,16) CONV-(O:256,K:4x4,S:2,P:1), BN, Leaky ReLU
(256,16,16)-> (512,8,8) CONV-(O:512,K:4x4,S:2,P:1), BN, Leaky ReLU
(512,8,8)-> (512,4,4) CONV-(O:512,K:4x4,S:2,P:1), BN, Leaky ReLU
(512,4,4)-> (512,2,2) CONV-(O:512,K:4x4,S:2,P:1), BN, Leaky ReLU
Output Layer() (512,2,2)-> (N,1,1) CONV-(O:N,K:4x4,S:2,P:1)
Table 2: The architecture of the discriminator
Part Input Shape-> Output Shape Layer Information
Up-Sampling (1024,2,2)-> (512,4,4) DECONV-(O:512,K:4x4,S:2,P:1), BN, ReLU
(1024,4,4)-> (512,8,8) DECONV-(O:512,K:4x4,S:2,P:1), BN, ReLU
(1024,8,8)-> (256,16,16) DECONV-(O:256,K:4x4,S:2,P:1), BN, ReLU
(512,16,16)-> (128,32,32) DECONV-(O:128,K:4x4,S:2,P:1), BN, ReLU
(256,32,32)-> (64,64,64) DECONV-(O:64,K:4x4,S:2,P:1), BN, ReLU
(64,64,64)-> (64,128,128) DECONV-(O:64,K:4x4,S:2,P:1), BN, ReLU
(64,128,128)-> (3,128,128) DECONV-(O:3,K:7x7,S:1,P:3), Tanh
Table 3: The architecture of the decoder

The notations in the tables: N: the number of domains; O: the number of output channels; K: the kernel size; S: the stride size; P: the padding size; BN: batch normalization;

a.2 Training Details

We randomly crop all images to 128128 for 128128 image-to-image translation. Then we randomly mirror and jitter the images in the pre-process step. All networks are trained from scratch. Weights are initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. The datasets used are listed in Table 4.

Name Size References
Cityscapes labels Photos 2975 (cordts2016cityscapes, )
Maps Photos 1096 (isola2017image, )
Facades labels Photos 400 (tylevcek2013spatial, )
Summer Winter Summer:1273 (zhu2017unpaired, )
Winter:854
Style transfer Monet:1074 (zhu2017unpaired, )
Cezanne:584
Van Gogh:401
Photo:6853
CUHK students 88 (wang2009face, )
Table 4: Datasets

a.3 Label Encoder Network

Before training the discriminator and the decoder, we first train the two-layer auto-encoder to obtain the label encoder network. The loss function of the two-layer auto-encoder is:

(7)

Where is the label vector of the domain. is the output of the two-layer auto-encoder.

The architecture of the two-layer auto-encoder is listed in Table 5. Batch normalization is used in both layers. For the label encoder, ReLU is used. Leaky ReLU with a negative scope of 0.2 is used in the label decoder.

Part Input shape-> Output shape Layer information
Label Encoder (N,1,1)-> (512,2,2) DECONV-(O:512,K:4x4,S:2,P:1), BN, ReLU
Label Decoder (512,2,2)-> (N,1,1) CONV-(O:N,K:4x4,S:2,P:1), BN, Leaky ReLU
Table 5: The architecture of the two-layer auto-encoder

We train the auto-encoder for epochs in our experiments, which takes about 10 minutes. The learning rate is 0.01 in the first half of training procedure and it linearly decays to 0 in the next half. When training the decoder and the discriminator, we use the label encoder as a pre-trained and fixed network.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199804
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description