Improved Techniques for GAN based Facial Inpainting
Abstract
In this paper we present several architectural and optimization recipes for generative adversarial network(GAN) based facial semantic inpainting. Current benchmark models are susceptible to initial solutions of nonconvex optimization criterion of GAN based inpainting. We present an endtoend trainable parametric network to deterministically start from good initial solutions leading to more photo realistic reconstructions with significant optimization speed up. For the first time, we show how to efficiently extend GAN based single image inpainter models to sequences by a)learning to initialize a temporal window of solutions with a recurrent neural network and b)imposing a temporal smoothness loss(during iterative optimization) to respect the redundancy in temporal dimension of a sequence. We conduct comprehensive empirical evaluations on CelebA images and pseudo sequences followed by real life videos of VidTIMIT dataset. The proposed method significantly outperforms current GAN based stateoftheart in terms of reconstruction quality with a simultaneous speedup of over 15. We also show that our proposed model is better in preserving facial identity in a sequence even without explicitly using any face recognition module during training.
1 Introduction
Semantic inpainting is a challenging task of recovering large corrupted areas of an object based on higher level image semantics. Classical inpainting methods [1, 2, 3, 4, 5] rely on low level cues to find best matching patches from the uncorrupted sections of the same image. However, such ’copypaste’ policy works well for background completions(sky, grass, mountains). However, the task of completing a complex object such as human face is far more challenging because the assumption of finding similar appearance patches does not always hold true. A facial image comprises of numerous unique components, which if damaged, cannot be matched with any other facial parts. An alternative is to use external reference datasets [6]. Though this paradigm enables to find similar matching patches, the low level [1] and mid level [2] features of matched patches are not sufficient to infer valid semantics of the missing regions.
Recently Yeh et al.[7] leveraged the recent advancement in generative modeling with Generative Adversarial Networks(GAN) [8]. Here, a trained neural network, often termed as the ‘Generator’, is trained to generate semantically realistic faces starting from a latent vector drawn from a known prior distribution. [7] is the current benchmark for semantic inpainting of faces. It outperforms Context Encoders [9] which was primarily designed for feature learning with inpainting. In this paper, we consider the model of Yeh et al. as baseline model and incorporate several architectural and optimization novelties for improving inpainting quality, optimization speed and adapting to inpaint sequences. Our application area is face inpainting. Specifically, our contributions can be summarized as follows:

We show, for a single image inpainitng, initializing a GAN based iterative non convex optimization criterion(Eq. 2) with a learned parametric neural network(Sec. 4.1), results in more photo realistic initial reconstructions(Fig. (a)a) compared to stateoftheart GAN based single image inpainter model with random initialization.

To our best knowledge this is the first demonstration of extending single image GAN based inpainter for sequences. For this, we design a recurrent neural network architecture(Sec. 4.2.1) for jointly initializing solutions for a group of frames. This design choice learns the scene dynamics leading to temporally more consistent initial solutions.

In a sequence, we exploit redundancy of temporal dimension with a smoothness loss(Sec. 4.2.2) which constraints the final joint iterative solutions of a group of neighboring frames to lie close to each other in Euclidean space. The smoothness loss is not only better in enforcing temporal consistency(Sec. 5.2.2) but is also apter in preserving the facial identity(Sec. 5.2.4) of the subject compared to the baseline version.

We present comprehensive empirical evaluations on CelebA images and pseudo sequences followed by real life facial videos from VidTIMIT dataset. In all cases, our proposed model outperforms the current benchmark baseline significantly in terms of visual reconstruction quality with an average speedup of over 15.
2 Background on GANs
Proposed by Goodfellow et al.[8], a GAN model consists of two parametrized deep neural nets, viz., generator, , and discriminator, . The task of generator is to yield an image, with a latent vector, , as input. is sampled from a known distribution, . A common choice [8] is, . The discriminator is pitted against the generator to distinguish real samples(sampled from ) from fake/generated samples. Specifically, discriminator and generator play the following game on :
(1) 
With enough capacity, on convergence, fools at random [8].
3 Approach
3.1 GAN based semantic inpainting
We begin by reviewing the current stateoftheart GAN based single image inpainting model of Yeh et al. [7], which serves as our reference baseline model. Given a damaged image, , and a pretrained GAN model, the idea is to iteratively find the ‘closest’ vector(starting from ) which results in a reconstructed image whose semantics are similar to corrupted image. is optimized as,
(2) 
is contextual loss which penalizes mismatch between original and reconstructed images over the non corrupted pixels.
(3) 
where is the Hadamard product operator. for uncorrupted pixels and 0 otherwise. is a trade off between the two components of the loss. is the perceptual loss and it a measure of realism of the inpainted output. The pretrained discriminator is leveraged for assigning this realism loss and is defined as,
(4) 
Since, gives the probability of being sampled from real images, Eq. 4 drives the solution of Eq. 2 to lie near to natural image manifold. Upon convergence, the inpainted image, , is given as, . Architectures of and are provided in supplemental document.
4 Single image inpainting
4.1 Initializing vector for single image inpainting
The iterative optimization procedure of Yeh et. al [7] in Eq. 2 yields different results based on the random initialization of ; this is mainly attributed to the non convexity of the optimization space. Also, such random initialization results in longer iterations of convergence(Sec. 5.1.2) compared to a good initialization of . The above mentioned problems can be mitigated if we learn to estimate a good vector directly from damaged image, , by feed forward mapping through a deep neural net . The parameter set, , is optimized to minimize some distance metric, :
(5) 
where is the corrupted image in dataset. Though Eq. 2 and 5 are functionally same, prediction using a learned parametric network tends to perform better than ad hoc iterative optimization. This is because, with evolution of training, the network learns to adapt parameters to map images with closely matching appearances to similar vectors. Parameter update for a given image thus implicitly generalizes to images with similar characteristics. We formulate the loss function, , as,
(6) 
The first component of the loss is a mean squared error(MSE) between original and inpainted images. The second component is the same as perceptual loss as defined in Eq. 4. The MSE loss helps in recovering the global low frequency components of an image while the perceptual loss helps to refine it further with incorporation of detailed high frequency texture components. The parameter, , strikes a balance between the two components of the loss.
4.2 Extending to series of frames
Initialization with a recurrent model
The naive approach of applying the formulation of [7] on sequences is to inpaint individual frames independently. However, such approach fails to leverage the temporal redundancy inherent in any sequence. In this regard, for sequences, we propose to use a Recurrent Neural Network (RNN) to jointly initialize vectors for an entire group of frames. RNN consist of a hidden state to summarize information observed upto that time step. The hidden state is updated after looking at the previous hidden state and the corrupted image, leading to more consistent reconstructions in terms of appearance.
Since, RNNs suffer from vanishing gradients problem[10] and are unable to capture long dependencies, we use Long Short Term Memory (LSTM) [11] Networks. LSTMs have produced stateoftheart results in sequential tasks like machine translation [12, 13] and sequence generation [14, 15].
In Fig. 3, we show the LSTM based network architecture for initializing a given group of frames. Let, be a sequence of total corrupted successive frames. Similar to [16], each frame is passed through a weight shared CNN descriptor module. Here our CNN’s architecture is same as that of . Each damaged frame, is represented by . The latent vector is passed as input to LSTM module at time step and the hidden states and cell memory of LSTM are updated. The hidden state is used to obtain the initial vector which is passed through the pretrained(and frozen parameters) generator, , to output the initial reconstructed image, . MSE loss between original image, , and is minimized w.r.t all the parameters of LSTM and the CNN descriptor network. Further details are provided in supplemental document.
Temporal smoothness loss()
The initialization method using the above mentioned recurrent model ensures that the initial solutions respect the smooth transition of scene dynamics. However, following the initialization, if we independently optimize for each frame, then the final solutions become unconstrained and manifest abrupt changes of facial appearance/expressions. To mitigate this, the idea is to jointly optimize a window of frames to encourage the final reconstructions to respect the smooth appearance transitions. Disparity between two inpainted images, and can be approximated by Euclidean distance between their latent vectors () [17]. With this approximation, we define
(7) 
It can be seen as a summation of distance loss between all possible pairwise combinations of vectors of the inpainted frames within the window of frames. In Sec. 5.2.2 we shall see the importance of temporal smoothness loss in yielding a more consistent set of frames(along temporal dimension) compared to the straight forward per frame application of Yeh et al.[7].
5 Experiments
5.1 Single image inpainting
Dataset
We evaluate our method on CelebA [18] dataset which comprises of 202,599 facial images with coarse alignment. Following the protocol of [7], we used 2000 images for testing inpainting performance. Remaining images were used for training GAN. Following face detection, facial bounding boxes are central cropped to 6464 and 128128 resolutions.
Effect of initialization of vector
In Fig. (a)a we show the benefit of initializing vector with a parametrized network as discussed in Sec. 4.1. As evident, a random initialization yields a solution which lies distinctly away from natural face manifold. On the other hand, our parametrized network learns to predict the latent vector by respecting the structural and textural statistics of the uncorrupted pixels. One major advantage of good initialization is the speed up of iterative optimization of Eq. 2. In Fig. 4, we show an exemplary convergence rates of the two components of Eq. 2. With the model initialized by our method, both perceptual and contextual losses start at an order less than [7]. This leads to much faster convergence. In fact, for most of the cases, our proposed model converges after 50 iterations compared to around 700 iterations with [7]; after this the visual quality does not improve much. Moreover, our solution tends to converge at lower magnitudes of losses and thereby yielding visually more realistic solutions(See Fig. 2). This is also evident from the peak signal to noise ratio (PSNR) between original image and the final solution image reported in Fig. (b)b. value in all cases. It is encouraging to see difference of performance is more appreciated at higher resolution of 128128 resolution.
5.2 Pseudo sequences
Motivation
Before directly applying our model on real facial sequences, we dedicate this section to analyze the benefits of our novelties on what we term as, ‘pseudo sequences’. A pseudo sequence, , of length is formed by taking a single image, , and masking it with different/same corruptions masks. An ideal inpainter should be agnostic of the corruption masks and yield identical reconstructions for all the frames. Since independent optimization of Eq. 2 is unconstrained, there is no explicit restriction on the vectors to be consistent; this is an inherent drawback of GAN based framework of [7] when applied on sequences.
Resolution @ 64X64  Resolution @ 128X128  

Central  Freehand  Checkboard  Central  Freehand  Checkboard  
Yeh et al. [7]  22.43  22.87  20.71  22.15  20.19  19.81 
Proposed(Smoothness Loss)  27.14  28.95  25.12  25.11  25.40  23.75 
Proposed(LSTM init + Smoothness Loss)  28.01  29.15  25.73  26.01  26.10  25.09 
Temporal Consistency
We define temporal consistency, , as the mean pairwise PSNR between all possible pairs() of inpainted frames within a pseudo sequence, , of length, ;
(8) 
Eq. 8 allows to enumerate the consistency of a generative model. Ideally we want =0. Please note that this evaluation is not possible on real videos because the transformation from one frame to another is not known and thus it is not possible to align the frames to a single frame of reference without incorporating interpolation noise with motion compensator[19]. In our results, ‘Smoothness Loss’ refers to temporal smoothness loss (Eq. 7). ‘LSTM init’ refers to initializing group of 3 frames using proposed LSTM model (Sec. 4.2.1).
Benefit of initialization with LSTM: First, in Fig. 6 we show the benefit of initializing solutions for a group of pseudo frames with LSTM over per frame independent initialization with . Frames initiated with LSTM tends to be more consistent compared to initiation by . This is attributed to the recurrent structure of LSTM which learns that in pseudo sequences, the frame are static. Learning such temporal dynamics is not possible by which is curated for single image initialization.
Consistency of final solutions: In Table I we compare the mean temporal consistency on the 2000 pseudo sequences created over the CelebA test set with =3. The reported mask patterns are: a) Central : randomly corrupt 40%70% of central part of image, b)Checkboard: corrupt 50% of image with checkboard sizes drawn uniformly from the set of {88, 1616, 3232}, c)Freehand: corrupt around 40% of pixels with randomly hand drawn masks. Our proposed method with Smoothness Loss (Eq. 7) fosters in a more consistent sequence of inpainting compared to the vanilla method of per frame model of Yeh et al. [7]. The observations are statistically significant with value 0.05 in all cases. Moreover if we initialize the vectors of the pseudo sequence with a LSTM model, then the consistency of the sequence improves. This can be attributed to more consistent initialization of vectors by LSTM followed by Smoothness Loss which maintains the similarity of vectors.
In Fig. 5 we visually show the advantage of our proposed modifications.
One has to appreciate that a set of inpainted frames by [7] is a mixture of faces with neutral and smiling appearances or different levels of smiles. However, our model yields a set with consistent facial appearance/expressions. In the context of real videos, these observations would mean that there can be drastic change of facial expressions among two adjacent corrupted frames if inpainted by [7]; such abrupt change of appearance is not common in videos. However, our model has the promise to inpaint a group of neighboring frames with consistency of appearance. Also, if inpainted by [7], the stationary portions of frames would tend to show flickering effects due to different hallucinations of textural details independently on each frame.
Disparity between converged vectors
To bolster the finding in the above section we also study the disparity of the converged vectors. Ideally, for a given pseudo sequence, the converged vectors should be identical. We can quantify the disparity using the temporal smoothness loss of Eq. 7. In Fig. (a)a we show an exemplary plot of decay of smoothness loss for a pseudo sequence. The proposed method of implicitly minimizing Eq. 7 results in near identical solutions for Eq. 2 over the sequence. However, the converged vectors using [7] shows more variation. Also, the latter method is slower in convergence.
Identity preservation
It is important that a sequence of inpainted frames not only appears visually realistic but also maintains the facial identity of the subject. To evaluate this we use FaceNet embeddings [20]. FaceNet learns a parametrized network, , to represent a given facial image into a 128D real vector; . Images of same subject yields similar embeddings and is enumerated by the L distance between the embeddings. For a given sequence, , identity loss, , is,
(9) 
is the i inpainted frame within the pseudo sequence and is the original uncorrupted image. In Fig. (b)b we report the mean identity loss over the 2000 pseudo sequences. Our proposed method(LSTM init + Smoothness Loss), retains the identity of a person over a sequence more veraciously than [7]. In our initial experiments, we explicitly included for optimizing Eq. 2. However, we get similar sequence identity preservation prowess with the Smoothness Loss constraint alone. Authors in [21] showed that a vector can be semantically decomposed to , the identity component and , the appearance component. Since our proposed Smoothness Loss enforces similarity of converged vectors, the task of identity preservation is implicitly incorporated in the process.
Approach  Resolution @ 64X64  
Subject Name  
mrj001  mwbt0  mtmr0  mtas1  mreb0  mrgg0  mdbb0  mjsw0  fjre0  fjas0  
Yeh et al. [7]  24.32  25.32  23.61  26.11  25.12  26.01  25.98  26.09  25.31  25.81  

26.12  27.01  25.11  27.11  26.91  26.98  27.11  27.21  27.00  27.71  

27.02  28.07  27.11  28.87  28.87  28.78  29.21  29.12  28.21  29.01  
Resolution @ 128X128  
Subject Name  
mrj001  mwbt0  mtmr0  mtas1  mreb0  mrgg0  mdbb0  mjsw0  fjre0  fjas0  
Yeh et al.  22.22  23.09  21.11  23.98  23.11  24.12  23.65  25.45  24.09  23.11  

24.15  25.51  23.78  25.23  25.08  25.18  25.32  25.36  25.78  25.98  

25.01  27.12  25.98  27.02  26.81  27.11  27.62  27.32  27.34  27.78 
Approach  Resolution @ 64X64  Resolution @ 128X128  

Contextual  Perceptual  FaceNet  Contextual  Perceptual  FaceNet  
Yeh et al. [7]  0.25  0.13  0.28  0.41  0.20  0.68 
Proposed(LSTM init + Smoothness Loss)  0.09  0.02  0.11  0.23  0.11  0.23 
5.3 Experiments on VidTIMIT dataset
The experiments with pseudo sequences taught us two lessons, viz., a)LSTM based group initialization is better than independent initialization with and b) Temporal smoothness loss is essential in imposing temporal consistency. With these understandings we proceed to test our model on real life facial video sequences. To our best knowledge, this is the first attempt towards GAN based inpainting on real videos. For this, we selected the VidTIMIT dataset [22]
6 Discussion
In this paper we proposed several innovations for better optimization of the GAN based inpainting cost function. The study on pseudo sequences enabled us to do ablation studies to appreciate the benefits of each component of our proposals. Since the generator was same for both the comparing models, the improvements are solely due to our contributions. Finally, we bolstered our understandings with experiments on real videos. However, the performance of inpainting strongly relies on the generative model and the GAN training procedure. An immediate extension would be to improve the generative model itself to generate photo realistic samples at higher resolutions. The recent works of Stacked GAN [23] and progressive stagewise training of GANs [24] show promise towards this end. It would be interesting to integrate the innovations of this paper in such high resolution generative model pipelines.
Acknowledgments
The work is funded by a Google PhD Fellowship to Avisek.
Footnotes
 Availabe at: http://conradsanderson.id.au/vidtimit/#examples
References
 C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patchmatch: A randomized correspondence algorithm for structural image editing,” ACM Transactions on GraphicsTOG, vol. 28, no. 3, p. 24, 2009.
 J.B. Huang, S. B. Kang, N. Ahuja, and J. Kopf, “Image completion using planar structure guidance,” ACM Transactions on graphics (TOG), vol. 33, no. 4, p. 129, 2014.
 M. V. Afonso, J. M. BioucasDias, and M. A. Figueiredo, “An augmented lagrangian approach to the constrained optimization formulation of imaging inverse problems,” IEEE Transactions on Image Processing, vol. 20, no. 3, pp. 681–695, 2011.
 A. A. Efros and T. K. Leung, “Texture synthesis by nonparametric sampling,” in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2. IEEE, 1999, pp. 1033–1038.
 C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patchmatch: A randomized correspondence algorithm for structural image editing,” ACM Transactions on GraphicsTOG, vol. 28, no. 3, p. 24, 2009.
 J. Hays and A. A. Efros, “Scene completion using millions of photographs,” in ACM Transactions on Graphics (TOG), vol. 26, no. 3. ACM, 2007, p. 4.
 R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. HasegawaJohnson, and M. N. Do, “Semantic image inpainting with deep generative models,” in CVPR, 2017, pp. 5485–5493.
 I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
 D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in CVPR, 2016, pp. 2536–2544.
 Y. Bengio, P. Simard, and P. Frasconi, “Learning longterm dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
 S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
 I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
 A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
 A. Kumar Jain, A. Agarwalla, K. Krishna Agrawal, and P. Mitra, “Recurrent memory addressing for describing videos,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017.
 O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 3156–3164.
 J.Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in ECCV. Springer, 2016, pp. 597–613.
 Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3730–3738.
 J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Realtime video superresolution with spatiotemporal networks and motion compensation,” CVPR, 2016.
 F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
 C. Donahue, A. Balsubramani, J. McAuley, and Z. C. Lipton, “Semantically decomposing the latent spaces of generative adversarial networks,” arXiv preprint arXiv:1705.07904, 2017.
 C. Sanderson and B. C. Lovell, “Multiregion probabilistic histograms for robust and scalable identity inference,” in International Conference on Biometrics. Springer, 2009, pp. 199–208.
 X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie, “Stacked generative adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2017, p. 4.
 T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.