# Unaligned Image-to-Sequence

Transformation with Loop Consistency

###### Abstract

We tackle the problem of modeling sequential visual phenomena. Given examples of a phenomena that can be divided into discrete time steps, we aim to take an input from any such time and realize this input at all other time steps in the sequence. Furthermore, we aim to do this without ground-truth aligned sequences — avoiding the difficulties needed for gathering aligned data. This generalizes the unpaired image-to-image problem from generating pairs to generating sequences. We extend cycle consistency to loop consistency and alleviate difficulties associated with learning in the resulting long chains of computation. We show competitive results compared to existing image-to-image techniques when modeling several different data sets including the Earth’s seasons and aging of human faces.

## 1 Introduction

Image-to-image translation has gained tremendous attention in recent years. A pioneering work by (isola2017image) shows that it is possible to realize a real image from one domain as a highly realistic and semantically meaningful image in another when paired data between the domains are available. Furthermore, CycleGAN (zhu2017unpaired) extended the image-to-image translation framework in an unpaired manner by relying on the ability to build a strong prior in each domain based off generative adversarial networks (GANs, (goodfellow2014generative)) and enforcing consistency on the cyclic transformation from and to a domain. Methods (kim2017learning; liu2017unsupervised) similar to CycleGAN have also been developed roughly around the same time. Since its birth, CycleGAN (zhu2017unpaired) has become a widely adopted technique with applications even beyond computer vision (fu2018style). However, CycleGAN family models are still somewhat limited since they only handle the translation problem (directly) between two domains. Modeling more than two domains would require separate instantiations of CycleGAN between any two pairs of domains — resulting in a quadratic model complexity. A major recent work, StarGAN (choi2018stargan), addresses this by facilitating a fully connected domain-translation graph, allowing transformation between two arbitrary domains with a single model. This flexibility, however, appears restricted to domains corresponding to specific attribute changes such as emotions and appearance.

Within nature, a multitude of settings exist where neither a set of pairs nor a fully-connected graph are the most natural representations of how one might proceed from one domain to another. In particular, many natural processes are sequentialand therefore the translation process should reflect this. A common phenomena modeled as an image-to-image task is the visual change of natural scenes between two seasons (zhu2017unpaired), e.g., Winter and Summer. This neglects the fact that nature first proceeds to Spring after Winter and Fall after Summer and therefore the pairing induces a very discontinuous reflection of the underlying process. Instead, we hope that by modeling a higher resolution discretization of this process, the model can more realistically approach the true model while reducing the necessary complexity of the model.

It is difficult to obtain paired data for many image-to-image problems. Aligned sequential are even more difficult to come by. Thus, it is more plausible to gather a large number of examples from each step (domain) in a sequence without correspondences between the content of the examples. Therefore, we consider a setting similar to unpaired image-to-image transformation where we only have access to unaligned examples from each time step of the sequence being modeled. Given an example from an arbitrary point in the sequence, we then generate an aligned sequence over all other time steps — expecting a faithful realization of the image at each step. The key condition that required is that after generating an entire loop (returning from the last domain to the input domain), one should expect to return to the original input. This is quite a weak condition and promotes model flexibility. We denote this extension to the cycle consistency of (zhu2017unpaired) as loop consistency and therefore name our approach as Loop-Consistent Generative Adversarial Networks (LoopGAN). This is a departure from many image-to-image approaches that have very short (usually length 2) paths of computation defining what it means to have gone “there and back”, e.g. the ability to enforce reconstruction or consistency. Since we do not have aligned sequences, the lengths of these paths for LoopGAN are as large as the number of domains being modeled and require different approaches to make learning feasible. These are not entirely different from the problems that often arise in recurrent neural networks and we can draw similarities to our model as a memory-less recurrent structure with applied to images.

We apply our method to the sequential phenomena of human aging (zhifei2017cvpr) and the seasons of the Alps (anoosheh2018combogan) with extensive comparisons with baseline methods for image-to-image translation. We also present additional results on gradually changing azimuth angle of chairs and gradual change of face attributes to showcased the flexibility of our model. We show favorable results against baseline methods for image-to-image translation in spite of allowing for them to have substantially larger model complexity.

## 2 Related Work

##### Generative Adversarial Networks

Generative adversarial networks (GANs, (goodfellow2014generative)) implicitly model a distribution through two components, a generator that transforms a sample from a simple prior noise distribution into a sample from the learned distribution over observable data. An additional component known as the discrimintor , usually a classifier, attempts to distinguish the generations of with samples from the data distribution. This forms a minimax game from which both and adapt to one another until some equilibrium is reached.

##### Unpaired Image-to-Image Transformation

As an extension to the image-to-image translation framework (pix2pix, (isola2017image)), (zhu2017unpaired) proposed CycleGAN which has a similar architecture as in (isola2017image) but is able to learn transformation between two domains without paired training data. To achieve this, CycleGAN simultaneously train two generators, one for each direction between the two domains. Besides the GAN loss enforced upon by domain-wise discriminators, the authors proposed to add a cycle-consistency loss which forces the two generators to be reversible. Similar to pix2pix, this model aims at learning a transformation between two domains and cannot be directly applied in multi-domain setting that involves more than two domains. Concurrent to CycleGAN, (liu2017unsupervised) proposed a method named UNIT that implicitly achieves alignment between two domains using a VAE-like structure where both domains share a common latent space. Furthermore, StarGAN ((choi2018stargan)) proposed an image-to-image translation model for multiple domains. A single network takes inputs defining the source image and desired domain transformation, however, it has been mainly shown to be successful for the domains consisting of facial attributes and expressions.

##### Multi-Modal Transformation

The problem of learning non-deterministic multi-modal transformation between two image domains has made progress in recent years ((huang2018multimodal; liu2018unified)). The common approach that achieves good performance is to embed the images for both domains into a shared latent space. At test time, an input image in the source domain is first embedded into the shared latent space and decoded into the target domain conditioned on a random noise vector. These models avoid one-to-one deterministic mapping problem and are able to learn different transformations given the same input image. However, these models are developed exclusively for two-domain transformation and cannot be directly applied to problems with more than two domains.

##### Style Transfer

A specific task in image-to-image transformation called style transfer is broadly defined as the task of transforming a photo into an artistic style while preserving its content (gatys2015neural; johnson2016perceptual). Common approaches use a pre-trained CNN as feature extractor and optimize the output image to match low-level features with that of style image and match high-level features with that of content image (gatys2015neural; johnson2016perceptual). A network architecture innovation made popular by this field known as AdaIn (huang2017arbitrary; dumoulin2017learned) combines instance normalization with learned affine parameters. It needs just a small set of parameters compared to the main network weights achieve different style transfers within the same network. It also shows great potential in improving image quality for image generation (karras2018style) and image-to-image transformation (huang2018multimodal).

##### Face Aging

Generating a series of faces in different ages given a single face image has been widely studied in computer vision. State-of-the-art methods (zhifei2017cvpr; palsson2018generative) use a combination of pre-trained age estimator and GAN to learn to transform the given image to different ages that are both age-accurate and preserve original facial structure. They rely heavily on a domain-specific age estimator and thus have limited application to the more general sequential image generation tasks that we try to tackle here.

##### Video Prediction

Video prediction attempts to predict some number of future frames of a video based on a set of input frames (xingjian2015convolutional; vondrick2016generating). Full videos with annotated input frames and target frames are often required for training these models. A combination of RNN and CNN models has seen success in this task (srivastava2015unsupervised; xingjian2015convolutional). Predictive vision techniques (vondrick2016generating; vondrick2017generating; wang2019eidetic) that use CNN or RNN to generate future videos also require aligned video clips in training. A recent work (gupta2018social) added a GAN as an extra layer of supervision for learning human trajectories. At a high level, video prediction can be seen as a supervised setting of our unsupervised task. Moreover, video prediction mostly aims at predicting movement of objections rather than transformation of a still object or scene which is the focus of our task.

## 3 Method

We formulate our method and objectives. Consider a setting of domains, where implies that occurs temporally before . This defines a sequence of domains. To make this independent of the starting domain, we additionally expect that can translate from to — something a priori when the sequence represents a periodic phenomena. We define a single generator where and . Then, a translation between two domains and of an input is given by repeated applications of in the form of (allowing for incrementing the second argument modulo after each application of ). By applying to an input times, we have formed a direct loop of translations where the source and target domains are equal. While we use a single generator, we make use of discriminators where is tasked with discriminating between a translation from any source domain to . Since we are given only samples from each domain , we refer to each domain as consisting of examples from the domain with data distribution .

### 3.1 Adversarial Loss

Suppose . Then we expect that for all other domains , should be indistinguishable under from (true) examples drawn from . Additionally, each should aim to minimize the ability for to generate examples that it cannot identify as fake. This forms the adversarial objective for a specific domain as:

where denotes iteratively applying until is transformed into domain , i.e. times. Taking this over all possible source domains, we get an overall adversarial objective as:

where is a prior on the set of domains, eg. uniform.

### 3.2 Loop Consistency Loss

Within (zhu2017unpaired), an adversarial loss was supplemented with a cycle consistency loss that ensured applying the generator from domain to domain followed by applying a separate generator from to acts like an identity function. However, LoopGAN only has a single generator and supports an arbitrary number of domains. Instead, we build a loop of computations by applying the generator to a source image times (equal to the number of domains being modeled). This constitutes loop consistency and allows us to reduce the set of possible transformations learned to those that adhere to the consistency condition. Loop consistency takes the form of an reconstruction objective for a domain as:

### 3.3 Full Objective

The combined loss of LoopGAN over both adversarial and loop-consistency losses can be written as:

(1) |

where weighs the trade-off between adversarial and loop consistency losses.

An example instantiation of our framework for one loop in a four-domain problem is shown in Figure 1.

## 4 Implementation

### 4.1 Network Architecture

We adopt the network architecture for style transfer proposed in (johnson2016perceptual) as our generator. This architecture has three main components: a down-sampling module , a sequence of residual blocks , and an up-sampling module . The generator therefore is the composition where the dependence of on only relates to the step-specific AdaIN parameters (huang2017arbitrary) while all other parameters are independent of . Following the notations from (johnson2016perceptual; zhu2017unpaired), let c7-k denote a 7 7 Conv-ReLU layer with k filters and stride 1, dk denote a 3 3 Conv-ReLU layer with k filters and stride 2, Rk denote a residual block with two 3 3 Conv-AdaIn-ReLU layers with k filters each, uk denotes a 3 3 fractional-strided-Conv-LayerNorm-ReLU layer with k filters and stride . The layer compositions of modules are down-sampling: c7-32, d64, d128; residual blocks: R128 6; up-sampling: u128, u64, c7-3. We use the PatchGAN discriminator architecture as (zhu2017unpaired): c4-64, c4-128, c4-256, c4-1, where c4-k denotes a 4 4 Conv-InstanceNorm-LeakyRelu(0.2) layer with k filters and stride 2.

### 4.2 Recurrent Transformation

Suppose we wish to translate some to another domain . A naive approach would formulate this as repeated application of , times. However, referencing our definition of , we can unroll this to find that we must apply and times throughout the computation. However, and are only responsible for bringing an observation into and out of the space of . This is not only a waste of computation when we only require an output at , but it has serious implications for the ability of gradients to propagate through the computation. Therefore, we implement as: a single application of , applications of , and a single application of . is applied recurrently and the entire generator is of the form:

We show in our ablation studies that this re-formulation is critical to the learning process and the resulting quality of the transformations learned. Additionally, is given a a set of separate, learnable normalization (AdaIN (huang2017arbitrary)) parameters that it selects based off of with all other parameters of being stationary across time steps. The overall architecture is shown in Figure 2.

### 4.3 Training

For all datasets, the loop-consistency loss coefficient is set to 10. We use Adam optimizer ((kingma2014adam)) with the initial learning rate of 0.0002, , and . We train the face aging dataset and Alps seasons dataset for 50 epochs and 70 epochs respectively with initial learning rate and linearly decay learning rate to 0 for 10 epochs for both datasets.

## 5 Experiments

We apply LoopGAN to two very different sequential image generation tasks: face aging and chaging seasons of scenery pictures. Baselines are built with two bi-domain models, CycleGAN (zhu2017unpaired) and UNIT (liu2017unsupervised) and also a general-purpose multi-domain model StarGAN (choi2018stargan). We are interested in the sequential transformation capabilities of separately trained bi-domains compared to LoopGAN. Therefore, for each of the two bi-domains models, we train a separate model between every pair of sequential domains, i.e. and and additionally train a model between every pair (not necessarily sequential) domains and (). The first approach allows us to build a baseline for sequential generation by chaining the (separately learned) models in the necessary order. For instance, if we have four domains: A, B, C, D, then we can train four separate CycleGAN (or UNIT) models: and correctly compose them to replicate the desired sequential transformation. Additionally, we can train direct versions e.g. of CycleGAN (or UNIT) for a more complete comparison against LoopGAN. We refer to composed versions of separately trained models as Chained-CycleGAN and Chained-UNIT depending on the base translation method used. Since StarGAN ((choi2018stargan)) inherently allows transformation between any two domains, we can apply this in a chained or direct manner without any additional models needing to be trained.

### 5.1 Face Aging

We adopt the UTKFace dataset (zhifei2017cvpr) for modeling the face aging task. It consists of over 20,000 face-only images of different ages. We divide the dataset into four groups in order of increasing age according to the ground truth age given in the dataset as A consisting of ages from 10-20, B containing ages 25-35, C containing ages 40-50, and D containing ages 50-100. The number of images for each group are 1531, 5000, 2245, 4957, respectively, where a 95/5 train/test split is made. The results of LoopGAN generation are shown in on the left side in Figure 3.

LoopGAN shows advantage over baseline models in two aspects. The overall facial structure is preserved which we believe is due to the enforced loop consistency loss. Moreover, LoopGAN is able to make more apparent age changes compared to the rest of baseline models.

In order to quantitatively compare the amount of age change between models, we obtain an age distribution of generated images by running a pre-trained age estimator DEX (rothe2015dex). The estimated age distributions of generated images (from input test images) are compared against those of the train images in Figure 4. The age distribution of LoopGAN generated images is closer to that of the train images across all four age groups when compared to the baseline models — suggesting that it more faithfully learns the sequential age distribution changes of the training data.

### 5.2 Changing Seasons

We use the collected scenery photos of Alps mountains of four seasons from (anoosheh2018combogan). They are ordered into a sequence starting from Spring (A), to Summer (B), Fall (C), and Winter (D). They each have approximately 1700 images and are divided into 95/5% training and test set.

We show the results in Figure 3. Overall, LoopGAN is able to make drastic season change while maintaining the overall structure of the input scenery images. To further quantify the generation results, we conducted a user study with Amazon Mechanical Turk (AMT) Table 1 which shows that LoopGAN generations are preferred by human users.

Model | Face Aging | Season Change | Overall |
---|---|---|---|

CycleGAN | 7.50% | 11.25% | 9.375% |

Chained-CycleGAN | 17.50% | 13.75% | 15.625% |

UNIT | 12.50% | 16.25% | 14.375% |

Chained-UNIT | 27.50% | 20.00% | 23.750% |

StarGAN | 11.25% | 10.00% | 10.625% |

LoopGAN (ours) | 23.75% | 28.75% | 26.250% |

### 5.3 Additional Datasets

To showcase the universality of our model, we apply LoopGAN to two additional datasets in four different sequential transformation tasks: chairs with different azimuth angles, and gradual change of face attributes in degree of smiling, gender feature, and hair color. The chairs dataset (aubry2014seeing) comes with ground truth azimuth angle and is divided into four sets each containing chairs facing a distinct direction. To obtain linear manifolds for the face attributes, we train a binary classifier with 0/1 labels available for each attribute (liu2018large) and use the predicted probability to determine the position of an image on an attribute manifold. The results are shown in Figure 5.

We experiment with several network architecture variations and investigate their effect on generation quality. First, attention mechanisms have proven to be useful in GAN image generation (zhang2018self). We added attention mechanim in both space and time dimension (wang2018non), however we found that the network struggles to generate high quality image after adding this type of attention mechanism. We also noticed that (huang2018multimodal) mentioned that for down-sampling, it is better to use no normalization to preserve information from input image, and for up-sampling it is better to use layer-normalization for faster training and higher quality. We applied these changes and found that they indeed help the network produce better results. The results under these variations are shown in Figure 6.a (first three rows).

Model Parameter Count CycleGAN 94.056 M * Chained-CycleGAN 62.704 M * UNIT 133.680 M * Chained-UNIT 89.120 M * StarGAN 8.427 M LoopGAN(ours) 11.008 M | |

(a) | (b) |

Moreover, we show the importance of the recurrent form of discussed in Section 4.2. We compare the choice to invoke and at each time step versus applying them once with some number of recurrent applications of in Figure 6.a (last row) and show the poor quality observed when performing the loop naively.

Lastly, we calculate the parameter count of generator networks compared in the face aging and season change experiments above and show that our final generator network architecture is parameter-efficient compared to baseline models in Figure 6.b.

For completeness, we also include a selection of failure cases in table in Figure 7.

## 6 Conclusion

We proposed an extension to the family of image-to-image translation methods when the set of domains corresponds to a sequence of domains. We require that the translation task can be modeled as a consistent loop. This allows us to use a shared generator across all time steps leading to significant efficiency gains over a naïve chaining of bi-domain image translation architectures. Despite this, our architecture shows favorable results when compared with the classic CycleGAN family algorithms.

#### Acknowledgments

This work is supported by NSF IIS-1717431 and NSF IIS-1618477. We thank Northrop Grumman for the gift funds. The authors thank Weijian Xu and Jun-Yan Zhu for valuable discussions.