Automatically Searching for U-Net Image Translator Architecture

Automatically Searching for U-Net Image Translator Architecture


Image translators have been successfully applied to many important low level image processing tasks. However, classical network architecture of image translator like U-Net, is borrowed from other vision tasks like biomedical image segmentation. This straightforward adaptation may not be optimal and could cause redundancy in the network structure. In this paper, we propose an automatic architecture searching method for image translator. By utilizing evolutionary algorithm, we investigate a more efficient network architecture which costs less computation resources and achieves better performance than the original one. Extensive qualitative and quantitative experiments are conducted to demonstrate the effectiveness of the proposed method. Moreover, we transplant the searched network architecture to other datasets which are not involved in the architecture searching procedure. Efficiency of the searched architecture on these datasets further demonstrates the generalization of the method.

1 Introduction

Image-to-image translation has been a fundamental research field of computer vision. Image-to-image translation tasks such as domain translation [20], image super-resolution [8], image colorization[6] and image denoising [1] are widely involved in image processing of common applications. To translate image from one domain to another, [7] predicts per-pixel color histogram as an intermediate output to help the translation procedure. [19] posts it as a classification task and uses class-rebalancing at training time to increase the reality of the results. The pix2pix method[4], firstly leverages generative adversarial networks in a conditional setting to solve the image-to-image translation issue.  [2] suggest that adversarial training might be unstable and prone to failure for high-resolution image generation tasks. Instead, they adopt a modified perceptual loss[5] to synthesize images, which are high resolution but often lack fine details and realistic textures.  [17] presented a novel adversarial loss as well as a coarse-to-fine generator and a multi-scale discriminator to address the high-resolution image-to-image translation problem. These previous methods have made great progress and achieves good performance on image-to-image translation tasks. However, the network architectures become more complicated and consume more computation cost.

In addition, most of them studied the loss functions, weight normalization and training techniques, which never deal with the architectures of translator explicitly. The design of deep neural networks can have enormous influence on its performance. ResNet [3] achieved much higher accuracy than VGGNet [16] on image classification but required fewer parameters by introducing the residual blocks.  [12] introduced deep convolutional generative adversarial networks which can generate better images. Although manually designed network architectures can achieve good performance in various tasks, recent progress have proved that architectures that built by automatically searching can outperform hand-craft structures.  [21] exploited reinforcement learning to generate architectures which can achieve better performance than the human-invented networks with less parameters.  [14] employed evolutionary algorithms to search the architecture of neural network.  [9] proposed a more efficient method for learning the structure of CNNs utilizing a sequential model-based optimization strategy.  [10] introduced a continuous relaxation to represent the architecture to allow efficient searching using gradient descent.

Figure 1: The framework of the proposed method for automatically searching the architecture of the image-to-image translation task. We search the channel numbers of each convolutional and deconvolutional layer, and the connection between the mirrored layers. The solid line represents the exsitance of the skip connection while the dotted line represents the opposite.

There are major differences between the visual recognition task and image translation task. First, generative model produced by an adversarial training process has to further deal with a discriminator model. While the traditional visual recognition only requests a discriminator. Second, the architecture of an image generator is different from that of an image classifier. Some generators take U-Net [15] as the basic architecture, which has a symmetrical structure and skip connections between the convolutional layers and deconvolutional layers. The searching space of U-net is fundamentally different from that of ordinary deep neural networks for classification. To this end, it is necessary to tailor a new neural architecture search method for generator in the image translation task.

In this paper, we propose a novel framework to automatically search the architecture of image translator utilizing the evolutionary algorithm. We use the U-net as the backbone and explore the channel numbers of each convolutional layer and deconvolutional layers as well as the essentiality of every skip connection in U-Net. We carefully design the genetic algorithm to apply it to the architecture search of U-Net. The network search is conducted on cityscapes dataset. Then, the searched architecture is transplanted to other benchmark dataset of image-to-image translation. Extensive experiments demonstrate the effectiveness of the proposed method.

2 Preliminaries

Here we first briefly introduce the image-to-image translation task with generative adversarial networks and the genetic algorithm for automatically searching the network architecture.

2.1 Image-to-Image Translation

image-to-image translation can be taken as a per-pixel classification or regression problem [6]. Recently, [4] used generative adversarial networks to solve this problem. In an image-to-image translation task, such as semantic map to street view on cityscapes dataset, the objective function within the framework of GANs can be written as


where represents the source domain (e.g. semantic map), and represents the target domain (e.g. street view). Denote the training sample in domain as , and the corresponding images in domain as . The generator transforms images from source domain to target domain (i.e. ), while the discriminator distinguishes which domain the input comes from. The task of domain transfer using generator and discriminator is formulated as a minimax problem. The generator is optimized to generated images to fool the discriminator . Whereas the discriminator is optimized to distinguish which domain the image come from.

In the task of paired image-to-image. translation [4], except for the regular loss for generator, and the conventional loss between fake images from the generator and real target images is often adopted, i.e.


Thus, the objective function for an image-to-image translator [4] can be formulated as


With the help of minimax design of GAN and loss, the generator can achieve satisfactory performance on image-to-image. translation task. However, the generator architecture U-net [15] is directly borrowed from biomedical image segmentation and may not be the optimal choice for this supervised task.

2.2 GA for Network Architecture Search

Neural architecture search has attracted much attention in the field of deep learning. There are two popular ways to do the search, evolutional algorithm and reinforcement learning. In this paper, we adopt evolutional algorithm to explore the architecture space as suggested in [13].

In Genetic Algorithm (GA) based network architecture search, we need to maintain a population of individuals, each of which represents a certain network architecture, to realize the searching process. The population in the current generation are regarded as parents, who breed next generation through three kinds of operations including selection, crossover and mutation, with the expectation that the subsequent offspring perform better than the parents. After enough number of generations, remaining offspring would suit better for the designed task. Each individual is assigned with a probability through a roulette algorithm according to its fitness:


where is the number of individuals in the population, is the -th individual for encoding a specific neural architecture, and is the fitness of . Then three operations will be executed with the probability of , and respectively, and .

Selection: Given probability , selection is conducted. An individual selected according to Fcn.4 is directly copied as an offspring. Individual with higher fitness has more chance to be preserved.

Crossover: Given probability , crossover is conducted. Two individuals from parent generation are selected according to Fcn.4. Random piece of parents will be swapped to generate two new offsprings. This operation is to integrate excellent genetic fragments of the parents.

Mutation: Given probability , mutation is conducted. One individual from parent generation is selected according to Fcn.4. Mutation randomly changes a random piece in the parent individual to produce an offspring. The common mutation operation for binary encoding is XOR operation. This operation increase the diversity of the population.

By iteratively employing these three genetic operations, initial population will be updated efficiently until the maximum iteration number is achieved. After obtaining the individual with optimal fitness, we can get a new architecture of network. The key points for applying the genetic algorithm to network architecture search is to design the representation code for each individual and the fitness which evaluates the performance of each individual of a specific task.

3 Method

In this part, we will introduce the details of the proposed method including the search space, the representation of the individual architecture and how evolutional algorithm is applied during the search process. We use U-Net as the backbone to illustrate this part.

3.1 Search Space of Architecture

U-Net [15] was first proposed for biomedical image segmentation while widely used in image-to-image translation task. U-Net is similar to an encoder-decoder network composed of sequential convolutional layers and deconvolutional layers. Through the network, feature maps are empirically down sampled with more channels, and are up sampled to the original size in the reverse manner [4]. Different from the conventional encoder-decoder structure, U-Net has skip connections between mirrored layers in the encoder and decoder stacks. In this work, we follow the encoder-decoder fashion, but explore channel numbers of each convolutional and deconvolutional layers as well as the essentiality of each skip connection. In addition, the sizes of filters in all convolutional layers are with a stride of for extracting visual features with considerable receptive fields and ensuring the consistency of the encoder-decoder architecture [15], which will not to be searched. Therefore, the search space of these configurations is defined as follows


where the set of available choices of channel numbers, which includes all the optional output channel numbers of each convolutional layers. Due to the symmetry of U-Net, we only have to operate on the first half of the network to determine the whole one. is the set of available choices of skip connection. is to keep a certain skip connection while is to remove the skip connection. For a U-Net generator used in [4], there are pairs of convolutional and deconvlutional layers with skip connections. The size of search space for this architecture is , which cannot be efficiently optimized by conventional methods.

3.2 Representation of Search Architecture

We use two fixed-length codes to represent each variant of U-Net. represent the code for output channel numbers, which defines output channel numbers of the first half of convolutional layers. represent the code for skip connection, which determine whether to keep or remove each skip connection. is the length of , and is the length of . To produce initial individuals, a bootstrap sampling from is repeated for times to generate a single channel number code and a bootstrap sampling from is repeated for times to generate a single connection code . Every pair represents a particular individual in the search space. Taking original U-Net for an example, and . The decoding process of the proposed method for constructing a new network is shown in Fig. 2. First step is to determine the pair of code , either from initial individual or genetic operations of selection, crossover and mutation. Then, a new network with a specific architecture will be established according to the channel number code . Finally, the numbers of input and output channels in each layer will be calculated based on the symmetric of U-Net and the connection code which indicates the remaining skip connections to formulate the resulting generator network.

Figure 2: The decoding procedure of the proposed method for reconstructing a U-Net using an individual. The channel number of each layer and the connections will be recognized from the given individual. Then, the number of convolution filters in each layer can be set accordingly.

3.3 Evolutionary Strategy

Usually, the generator and the discriminator in a conventional adversarial network are optimized alternatively, and the discriminator will be discarded after the training procedure. Therefore, we propose to utilize the genetic algorithm to update the population to get better architectures of the generator network, i.e. . The objective function for generator is formulated as


To evaluate the performance of each individual architecture in each generation among the population, we have to consider both the model performance and computation efficiency. After some mini-epoch training of each individual, we conduct a validation procedure to evaluate each individual architecture. Individual fitness is calculated as


where is the FLOPs required by the given architecture for processing each input image, and


where represents the cumulative loss of the generator during validation procedure, which reflects the image translation quality of the architecture. is composed of a generator loss and the loss with a hyper parameter . The hyper parameter is utilized for balancing the computation cost and model performance. A larger represents that the evaluation of individuals focuses more on reducing the computation cost. Note that both FLOPs and generator loss are negative correlated with the fitness of network architecture, so a reciprocal operation is added in the fitness function.

3.4 Search Procedure

We adopt the classic genetic algorithm to conduct network architecture search described as Alg. 1. First, we initialize the first generation of population using random sampling from and . For computation speed, we select subset of dataset for fast training and validation, and denote them as mini train set and mini validation set. Then, we apply every individual in the population for mini-train and mini-validation for calculating individual fitness utilizing Fcn .7. Then we can determine the probability for every individual to be selected. After that, we use three genetic operations, including selection, cross over and mutation to generate offsprings in the iteration procedure.

By iteratively updating a series of individuals in the population using the genetic operations for times, a new architecture of the generator network is discovered with a high probability to perform better than the original architecture in the image-to-image translation task. For an easier complementation, we adopt genetic algorithm separately for channel number code and connection code . For selection and cross over, regular strategy of genetic algorithm is adopted. As for mutation, we have to design special strategy to operate because the channel number code is not binary. Thus we cannot use XNOR operation directly for mutation. For mutation of channel number code , we design a unique mutation operation: for an element corresponding to a channel number to be mutated, we remove the past channel number from set and randomly select one from the remaining numbers as the new channel number.

0:  Training set from two domains and including paired images, the number of individuals , the maximum iteration number in the genetic algorithm, , , and training parameters, etc.
1:  Randomly initialize the population with individuals according to the search space;
2:  for  to  do
3:     Train each individual in on the mini-train set.
4:     Test each trained individual on the mini-validation set.
5:     Calculate fitness of each individual in (Fcn. 7);
6:     Obtain selecting probabilities (Fcn. 4);
7:     for  to  do
8:        Preserve the best individual in into ;
9:        Generate a random value ;
10:        Conduct selection, crossover, and mutation for generating new individuals according to ;
11:     end for
12:  end for
13:  Update finesses of individuals in ;
14:  Establish a new generator network by exploiting to the best individual in ;
14:  The new generator with the searched architecture after fine-tuning using the entire training set.
Algorithm 1 Evolutionary search for the generator network.
Input Images Original Results
Figure 3: Images generated using the generator searched by exploiting the proposed method with different hyper-parameters , which balance the computation FLOPs and accumulated generator loss. A lower represents model with larger FLOPs.

4 Experiments

In the image-to-image translation task, we evaluate the effectiveness of the proposed method to search the optimal architecture of an image generator. Extensive experiments are conducted on several paired image-to-image translator dataset including cityscapes, facades and maps. For a fair comparison, we use the same training strategy and same discriminator structure as pix2pix [4].

For parameters settings, we use , max generation and number of population is . Probabilities of are adopted for selection, cross over and mutation in each generation as suggested in [18]. For the input image of and output image of , the length for channel number code is , and the length for skip connection code is . First, we conduct architecture search experiment on the cityscape dataset with different hyper parameters. Then, we immigrate the searched architecture to other image-to-image translation dataset like maps and facades which doesn’t get involved of the searching procedure to prove the generalization ability of the searched architecture.

Input Images Original Results Proposed Method Input Images Original Results Proposed method
Figure 4: Some translation results on maps dataset and facades dataset respectively. The network architecture is directly borrowed from the search result of cityscapes dataset. The size of the searched model is about 22MB compared to 208MB of the original model, and the FLOPs of searched network is significantly fewer than that of the original one.

4.1 Search on Cityscapes Dataset

We conduct network architecture search experiments on cityscapes dataset, from semantic maps to street views. We use the described genetic algorithm to explore the defined network search space. In the experiments, we adopt different hyper parameters to balance model performance and computation cost. A smaller hyper parameter means less focus on the network FLOPs, resulting in more computation consumption and better image translation quality.

Method Memory FLOPs Mean Pixel Acc Mean Class Acc. Mean class IoU
Original [4] MB 18147M 0.723 0.244 0.186
MB 8422M 0.717 0.241 0.184
MB 15363M 0.738 0.248 0.190
MB 47431M 0.744 0.250 0.197
Table 1: FCN scores of different generators calculated on the cityscapes dataset.
Figure 5: Comparison of the original and searched network architecture. The top shows the architecture of the original U-Net. The bottom shows the architecture of the searched U-Net with . The two numbers in each box represent the input channel number and output channel number of each convolutional layers on deconvolutional layers, respectively. The searched U-Net has sparse skip connections compared to the original one, and the channel numbers do not increase as the network goes deeper.

To further evaluate the effectiveness of the searching method, we conduct quantitative experiments to evaluate the performance of searched architecture. We adopt ”FCN-score” proposed in [4] to numerically evaluate the architecture. A pre-trained FCN-8s [11] network is utilized to conduct segmentation task. Three measurements, mean pixel accuracy, mean class accuracy and mean class IOU are calculated. The results are shown in Tab. 1. The FLOPs of the model decrease with the increasing of gamma but mean pixel accuracy, mean class accuracy and mean class IOU decrease as well. When , the network architecture can achieve higher FCN scores than that of the original model. The memory size of the model is only 22MB, almost 1/10 of the original model. when , we can get a model which costs less than half of the original model but achieves slightly lower scores. Actually, we can get a various generator architectures fit for different applications. In addition, we can conclude that the model performance is more related with computation FLOPs, rather than the memory size of the model. Interestingly, all of searched models have less parameters than that of the original model, which means that the searched architecture cost less memory storage while achieving comparable or even better performance than the original one. The parameter redundancy of original U-Net is very severe in the image-to-image translation task. In Fig. 3, we show some results of the image translation from semantic map to street views. With a lower , the number of FLOPs of model is larger, which leads to more computation cost but the better image translation quality.

4.2 Architecture Transfer to Other Datasets

After network search experiments on cityscapes dataset, we get new architectures of generator. To demonstrate the effectiveness of the searched architecture, we utilize the architecture to train other paired image-to-image tasks, such as maps to satellite maps and color bar labels to real facades. For a fair comparison, we use the architecture with which has less FLOPs compared to that of original U-Net to train on other image-to-image datasets, such as maps and facades. Fig. 4 show some results of the image transfer tasks of maps and facades respectively. From visual respect, newly searched network outperforms the original network architecture with less FLOPs and far less parameters. On maps dataset, the searched network generates satellite images more consistent with input maps. On facades dataset, the searched network generates images with more realistic windows and doors. It is worth noting that neither dataset of maps nor facades gets involved in the process of network architecture search. It demonstrates the generalization of the searched architecture and the effectiveness of the searching method furthermore.

4.3 Discussion of the Searched Architecture

Through the previous network architecture searching experiment, we get new architectures of generator. As shown in Fig. 5, we compare the new architecture with and original network architecture. These two architectures have comparable FLOPs and the new one has slightly less. In the searched architecture, we find it is not necessary to increase number of channels as network goes deeper. Only a few wide layers at specific positions is enough for good performance. Similar phenomenons can also be observed in the other two architectures in Tab. 1

In addition, not every connection is essential, the first and third skip connections are preserved while other skip connections are removed. This phenomenon shows that the skip connection involving layers with larger feature maps’ sizes are more essential in the image-to-image translation task. On the other hand, it also demonstrates that long distance connection is more important than the close connection, since far skip connection mixes less correlated information whereas the close skip connection consults more computation cost while involve less important information. Thus, most of the connections are removed.

As for the channel numbers of each convolutional layers and deconvolutional layers, the original U-Net has a similar encoder-decoder structure and gradually gets wider with continuously more convolutional layers. It reaches the widest channel numbers when feature map approaches the bottleneck. Then, the channel numbers of each subsequent deconvolutional layer is reducing accordingly. For the newly searched architecture, this principle seems to be broken. Shallow layers tend to have more channels while layers near the bottleneck tend to have less channels. More computation cost lies in the shallow layers. In the image-to-image translation task, more carefully designed shallow layers help to restore of details of the image.

5 Conclusion

In this paper, we propose a novel automatically network search method for paired image-to-image translator. By utilizing the evolutional algorithm, we search the channel numbers of each convolutional layers and deconvolutional layers and connections of the skip connections for a better generator architecture. We carefully design the genetic algorithm for the architecture search of U-Net. Extensive experiments with different hyper parameters are conducted to balance model performance and computation cost. Through experiments on cityscapes dataset, we get a high performance model with fewer FLOPs compared to the original backbone U-Net. The FCN scores of new architecture on cityscapes segmentation exceed the original U-Net. Furthermore, we immigrate this model architecture to other paired image-to-image translation tasks including maps and facades, and the new architecture outperforms the original one. Finally, we discuss the new architecture with the original one and this may suggest more experiences to the design of network architecture.


  1. J. Chen, J. Chen, H. Chao and M. Yang (2018) Image blind denoising with generative adversarial network based noise modeling. In CVPR, pp. 3155–3164. Cited by: §1.
  2. Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520. Cited by: §1.
  3. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  4. P. Isola, J. Zhu, T. Zhou and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. arXiv preprint. Cited by: §1, §2.1, §2.1, §3.1, §4.1, Table 1, §4.
  5. J. Johnson, A. Alahi and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §1.
  6. G. Larsson, M. Maire and G. Shakhnarovich (2016) Learning representations for automatic colorization. In European Conference on Computer Vision, pp. 577–593. Cited by: §1, §2.1.
  7. G. Larsson, M. Maire and G. Shakhnarovich (2016) Learning representations for automatic colorization. In European Conference on Computer Vision, pp. 577–593. Cited by: §1.
  8. C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz and Z. Wang (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, pp. 4. Cited by: §1.
  9. C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §1.
  10. H. Liu, K. Simonyan and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1.
  11. J. Long, E. Shelhamer and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §4.1.
  12. A. Radford, L. Metz and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1.
  13. E. Real, A. Aggarwal, Y. Huang and Q. V. Le (2018) Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548. Cited by: §2.2.
  14. E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le and A. Kurakin (2017) Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §1.
  15. O. Ronneberger, P. Fischer and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §2.1, §3.1.
  16. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  17. T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807. Cited by: §1.
  18. Y. Wang, C. Xu, J. Qiu, C. Xu and D. Tao (2018) Towards evolutional compression. In SIGKDD, Cited by: §4.
  19. R. Zhang, P. Isola and A. A. Efros (2016) Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §1.
  20. J. Zhu, T. Park, P. Isola and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint. Cited by: §1.
  21. B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description