Image Inpainting via Generative Multi-column Convolutional Neural Networks
In this paper, we propose a generative multi-column network for image inpainting. This network synthesizes different image components in a parallel manner within one stage. To better characterize global structures, we design a confidence-driven reconstruction loss while an implicit diversified MRF regularization is adopted to enhance local details. The multi-column network combined with the reconstruction and MRF loss propagates local and global information derived from context to the target inpainting regions. Extensive experiments on challenging street view, face, natural objects and scenes manifest that our method produces visual compelling results even without previously common post-processing.
Image inpainting (also known as image completion) aims to estimate suitable pixel information to fill holes in images. It serves various applications such as object removal, image restoration, image denoising, to name a few. Though studied for many years, it remains an open and challenging problem since it is highly ill-posed. In order to generate realistic structures and textures, researchers resort to auxiliary information, from either surrounding image areas or external data.
A typical inpainting method exploits pixels under certain patch-wise similarity measures, addressing three important problems respectively to (1) extract suitable features to evaluate patch similarity; (2) find neighboring patches; and (3) to aggregate auxiliary information.
Features for Inpainting
Suitable feature representations are very important to build connections between missing and known areas. In contrast to traditional patch-based methods using hand-crafted features, recent learning-based algorithms learn from data. From the model perspective, inpainting requires understanding of global information. For example, only by seeing the entire face, the system can determine eyes and nose position, as shown in top-right of Figure 1. On the other hand, pixel-level details are crucial for visual realism, e.g. texture of the skin/facade in Figure 1.
Recent CNN-based methods utilize encoder-decoder pathak2016context (); yeh2017semantic (); yang2017high (); iizuka2017globally (); yu2018generative () networks to extract features and achieve impressive results. But there is still much room to consider features as a group of different components and combine both global semantics and local textures.
Reliable Similar Patches
In both exemplar-based he2012statistics (); he2014image (); criminisi2004region (); sun2005image (); jia2003image (); jia2004inference (); barnes2009patchmatch () and recent learning-based methods pathak2016context (); yeh2017semantic (); yang2017high (); iizuka2017globally (); yu2018generative (), explicit nearest-neighbor search is one of the key components for generation of realistic details. When missing areas originally contain structure different from context, the found neighbors may harm the generation process. Also, nearest-neighbor search during testing is also time-consuming. Unlike these solutions, we in this paper apply search only in the training phase with improved similarity measure. Testing is very efficient without the need of post-processing.
Another important issue is that inpainting can take multiple candidates to fill holes. Thus, optimal results should be constrained in a spatially variant way – pixels close to area boundary are with few choices, while the central part can be less constrained. In fact, adversarial loss has already been used in recent methods pathak2016context (); yeh2017semantic (); yang2017high (); iizuka2017globally (); yu2018generative () to learn multi-modality. Various weights are applied to loss pathak2016context (); yeh2017semantic (); yu2018generative () for boundary consistency. In this paper, we design a new spatial-variant weight to better handle this issue.
The overall framework is a Generative Multi-column Convolutional Neural Network (GMCNN) for image inpainting. The multi-column structure ciregan2012multi (); zhang2016single (); agostinelli2013adaptive () is used since it can decompose images into components with different receptive fields and feature resolutions. Unlike multi-scale or coarse-to-fine strategies yang2017high (); karras2017progressive () that use resized images, branches in our multi-column network directly use full-resolution input to characterize multi-scale feature representations regarding global and local information. A new implicit diversified Markov random field (ID-MRF) term is proposed and used in the training phase only. Rather than directly using the matched feature, which may lead to visual artifacts, we incorporate this term as regularization.
Additionally, we design a new confidence-driven reconstruction loss that constrains the generated content according to the spatial location. With all these improvements, the proposed method can produce high quality results considering boundary consistency, structure suitability and texture similarity, without any post-processing operations. Exemplar inpainting results are given in Figure 1.
2 Related Work
Exemplar-based Inpainting Among traditional methods, exemplar-based inpainting he2012statistics (); he2014image (); criminisi2004region (); sun2005image (); jia2003image (); jia2004inference (); barnes2009patchmatch () copies and pastes matching patches in a pre-defined order. To preserve structure, patch priority computation specifies the patch filling order criminisi2004region (); he2012statistics (); he2014image (); sun2005image (). With only low-level information, these methods cannot produce high-quality semantic structures that do not exist in examples, e.g., faces and facades.
CNN Inpainting Since the seminal context-encoder work pathak2016context (), deep CNNs have achieved significant progress. Pathak et al. proposed training an encoder-decoder CNN and minimizing pixel-wise reconstruction loss and adversarial loss. Built upon context-encoder, in iizuka2017globally (), global and local discriminators helped improve the adversarial loss where a fully convolutional encoder-decoder structure was adopted. Besides encoder-decoder, U-net-like structure was also used yan2018shift ().
Yang et al.yang2017high () and Yu et al.yu2018generative () introduced coarse-to-fine CNNs for image inpainting. To generate more plausible and detailed texture, combination of CNN and Markov Random Field yang2017high () was taken as the post-process to improve inpainting results from the coarse CNN. It is inevitably slow due to iterative MRF inference. Lately, Yu et al. conducted nearest neighbor search in deep feature space yu2018generative (), which brings clearer texture to the filling regions compared with previous strategies of a single forward pass.
3 Our Method
Our inpainting system is trainable in an end-to-end fashion, which takes an image and a binary region mask (with value 0 for known pixels and 1 otherwise) as input. Unknown regions in image are filled with zeros. It outputs a complete image . We detail our network design below.
3.1 Network Structure
Our proposed Generative Multi-column Convolutional Neural Network (GMCNN) shown in Figure 2 consists of three sub-networks: a generator to produce results, global&local discriminators for adversarial training, and a pretrained VGG network simonyan2014very () to calculate ID-MRF loss. In the testing phase, only the generator network is used.
The generator network consists of () parallel encoder-decoder branches to extract different levels of features from input with mask , and a shared decoder module to transform deep features into natural image space . We choose various receptive fields and spatial resolutions for these branches as shown in Figure 2, which capture different levels of information. Branches are denoted as (), trained in a data driven manner to generate better feature components than handcrafted decomposition.
Then these components are up-sampled (bilinearly) to the original resolution and are concatenated into feature map . We further transform features into image space via shared decoding module with 2 convolutional layers, denoted as . The output is . Minimizing the difference between and makes capture appropriate components in for inpainting. further transforms such deep features to our desired result. Note that although seems independent of each other, they are mutually influenced during training due to .
Our framework is by nature different from commonly used one-stream encoder-decoder structure and the coarse-to-fine architecture yang2017high (); yu2018generative (); karras2017progressive (). The encoder-decoder transforms the image into a common feature space with the same-size receptive field, ignoring the fact that inpainting involves different levels of representations. The multi-branch encoders in our GMCNN contrarily do not have this problem. Our method also overcomes the limitation of the coarse-to-fine architecture, which paints the missing pixels from small to larger scales where errors in the coarse-level already influence refinement. Our GMCNN incorporates different structures in parallel. They complement each other instead of simply inheriting information.
3.2 ID-MRF Regularization
Here, we address aforementioned semantic structure matching and computational-heavy iterative MRF optimization issues. Our scheme is to take MRF-like regularization only in the training phase, named implicit diversified Markov random fields (ID-MRF). The proposed network is optimized to minimize the difference between generated content and corresponding nearest-neighbors from ground truth in the feature space. Since we only use it in training, complete ground truth images make it possible to know high-quality nearest neighbors and give appropriate constraints for the network.
To calculate ID-MRF loss, it is possible to simply use direct similarity measure (e.g. cosine similarity) to find the nearest neighbors for patches in generated content. But this procedure tends to yield smooth structure, as a flat region easily connects to similar patterns and quickly reduces structure variety, as shown in Figure 3(a). We instead adopt a relative distance measure mechrez2018contextual (); mechrez2018learning (); talmi2017template () to model the relation between local features and target feature set. It can restore subtle details as illustrated in Figure 3(b).
Specifically, let be the generated content for the missing regions, and are the features generated by the feature layer of a pretrained deep model. For neural patches and extracted from and respectively, the relative similarity from to is defined as
where is the cosine similarity. means belongs to excluding . and are two positive constants. If is like more than other neural patches in , turns large.
Next, is normalized as
Finally, with Eq. (2), the ID-MRF loss between and is defined as
where is a normalization factor. For each , means is closer to compared with other neural patches in . In the extreme case that all neural patches in are close to one patch , other patches have their small. So is large.
On the other hand, when the patches in are close to different candidates in , each in has its unique nearest neighbor in . The resulting is thus big and becomes small. We show one example in the supplementary file. From this perspective, minimizing encourages each in to approach different neural patches in , diversifying neighbors, as shown in Figure 3(b).
An obvious benefit for this measure is to improve the similarity between feature distributions in and . By minimizing the ID-MRF loss, not only local neural patches in find corresponding candidates from , but also the feature distributions come near, helping capture variation in complicated texture.
Our final ID-MRF loss is computed on several feature layers from VGG19. Following common practice gatys2016image (); li2016combining (), we use conv4_2 to describe image semantic structures. Then conv3_2 and conv4_2 are utilized to describe image texture as
During training, ID-MRF regularizes the generated content based on the reference. It has the strong ability to create realistic texture locally and globally. We note the fundamental difference from the methods of yang2017high (); yu2018generative (), where nearest-neighbor search via networks is employed in the testing phase. Our ID-MRF regularization exploits both reference and contextual information inside and out of the filling regions, and thus causes high diversity in inpainting structure generation.
3.3 Information Fusion
Spatial Variant Reconstruction Loss
Pixel-wise reconstruction loss is important for inpainting pathak2016context (); yeh2017semantic (); yu2018generative (). To exert constraints based on spatial location, we design the confidence-driven reconstruction loss where unknown pixels close to the filling boundary are more strongly constrained than those away from it. We set the confidence of known pixels as 1 and unknown ones related to the distance to the boundary. To propagate the confidence of known pixels to unknown ones, we use a Gaussian filter to convolve to create a loss weight mask as
where is with size and its standard deviation is . and . is the Hadamard product operator. Eq. (5) is repeated several times to generate . The final reconstruction loss is
where is the output of our generative model , and denotes learn-able parameters.
Compared with the reconstruction loss used in pathak2016context (); yeh2017semantic (); yu2018generative (), ours exploits spatial locations and their relative order by considering confidence on both known and unknown pixels. It results in the effect of gradually shifting learning focus from filling border to the center and smoothing the learning curve.
Adversarial loss is a catalyst in filling missing regions and becomes common in many creation tasks. Similar to those of iizuka2017globally (); yu2018generative (), we apply the improved Wasserstein GAN gulrajani2017improved () and use local and global discriminators. For the generator, the adversarial loss is defined as
where and .
3.4 Final Objective
With confidence-driven reconstruction loss, ID-MRF loss, and adversarial loss, the model objective of our net is defined as
where and are used to balance the effects between local structure regularization and adversarial training.
We train our model first with only confidence-driven reconstruction loss and set and to 0s to stabilize the later adversarial training. After our model converges, we set and for fine tuning until converge. The training procedure is optimized using Adam solver kingma2014adam () with learning rate . We set and . The batch size is 16.
For an input image , a binary image mask (with value 0 for known and 1 for unknown pixels) is sampled at a random location. The input image is produced as . Our model takes the concatenation of and as input. The final prediction is . All input and output are linearly scaled within range .
We evaluate our method on five datasets of Paris street view pathak2016context (), Places2 zhou2017places (), ImageNet russakovsky2015imagenet (), CelebA liu2015deep (), and CelebA-HQ karras2017progressive ().
4.1 Experimental Settings
We train our models on the training set and evaluate our model on the testing set (for Paris street view) or validation set (for Places2, ImageNet, CelebA, and CelebA-HQ). In training, we use images of resolution with the largest hole size in random positions. For Paris street view, places2, and ImageNet, images are randomly cropped and scaled from the full-resolution images. For CelebA and CelebA-HQ face datasets, images are scaled to . All results given in this paper are not post-processed.
Our implementation is with Tensorflow v1.4.1, CUDNN v6.0, and CUDA v8.0. The hardware is with an Intel CPU E5 (2.60GHz) and TITAN X GPU. Our model costs 49.37ms and 146.11ms per image on GPU for testing images with size and , respectively. Using ID-MRF in training phrase costs 784ms more per batch (with 16 images of pixels). The total number of parameters of our generator network is 12.562M.
4.2 Qualitative Evaluation
As shown in Figures 8 and 10, compared with other methods, ours gives obvious visual improvement on plausible image structures and crisp textures. The more reasonably generated structures mainly stem from the multi-column architecture and confidence-driven reconstruction loss. The realistic textures are created via ID-MRF regularization and adversarial training by leveraging the contextual and corresponding textures.
In Figures 9, we show partial results of our method and CA yu2018generative () on CelebA and CelebA-HQ face datasets. Since we do not apply MRF in a non-parametric manner, visual artifacts are much reduced. It is notable that finding suitable patches for these faces is challenging. Our ID-MRF regularization remedies the problem. Even the face shadow and reflectance can be generated as shown in Figure 9.
4.3 Quantitative Evaluation
Although the generation task is not suitable to be evaluated by peak signal-to-noise ratio (PSNR) or structural similarity (SSIM), for completeness, we still give them on the testing or validation sets of four used datasets for reference. In ImageNet, only 200 images are randomly chosen for evaluation since MSNPS yang2017high () takes minutes to complete a size image. As shown in Table 1, our method produces decent results with comparable or better PSNR and SSIM.
We also conduct user studies as shown in Table 2. The protocol is based on large batches of blind randomized A/B tests deployed on the Google Forms platform. Each survey involves a batch of 40 pairwise comparisons. Each pair contains two images completed from the same corrupted input by two different methods. There are 40 participants invited for user study. The participants are asked to select the more realistic image in each pair. The images are all shown at the same resolution (). The comparisons are randomized across conditions and the left-right order is randomized. All images are shown for unlimited time and the participant is free to spend as much time as desired on each pair. In all conditions, our method outperforms the baselines.
|Method||Pairs street view-100||ImageNet-200||Places2-2K||CelebA-HQ-2K|
|Paris street view||ImageNet||Places2||CelebA||CelebA-HQ|
|GMCNN > CE ||-||-||-|
|GMCNN > MSNPS ||-||-||-|
|GMCNN > CA |
|Model||Encoder-decoder||Coarse-to-fine||GMCNN-f||GMCNN-v w/o ID-MRF||GMCNN-v|
4.4 Ablation Study
Single Encoder-Decoder vs. Coarse-to-Fine vs. GMCNN We evaluate our multi-column architecture by comparing with single encode-decoder and coarse-to-fine networks with two sequential encoder-decoder (same as that in yu2018generative () except no contextual layer). The single encoder-decoder is just the same as our branch three (B3). To minimize the influence of model capacity, we triple the filter sizes in the single encoder-decoder architecture to make its parameter size as close to ours as possible. The loss for these three structures is the same, including confidence-driven reconstruction loss, ID-MRF loss, and WGAN-GP adversarial loss. The corresponding hyper-parameters are the same. The testing results are shown in Figure 4. Our GMCNN structure with varied receptive fields in each branch predicts reasonable image structure and texture compared with single encoder-decoder and coarse-to-fine structure. Additional quantitative experiment is given in Table 3, showing the proposed structure is beneficial to restore image fidelity.
Varied Receptive Fields vs. Fixed Receptive Field We then validate the necessity of using varied receptive fields in branches. The GMCNN with the same receptive field in each branch turns to using 3 identical third Branches in Figure 2 with filter size . Figure 4 shows within the GMCNN structure, branches with varied receptive fields give visual more appealing results.
Spatial Discounted Reconstruction Loss vs. Confidence-Driven Reconstruction Loss We compare our confidence-driven reconstruction loss with alternative spatial discounted reconstruction loss yu2018generative (). We use a single-column CNN trained only with the losses on the Paris street view dataset. The testing results are given in Figure 5. Our confidence-driven reconstruction loss works better.
With and without ID-MRF Regularization We train a complete GMCNN on the Paris street view dataset with all losses and one model that does not involve ID-MRF. As shown in Figure 6, ID-MRF can significantly enhance local details. Also, the qualitative and quantitative changes are given in Table 4 and Figure 7 about how affects inpainting performance. Empirically, strikes a good balance.
We have primarily addressed the important problems of representing visual context and using it to generate and constrain unknown regions in inpainting. We have proposed a generative multi-column neural network for this task and showed its ability to model different image components and extract multi-level features. Additionally, the ID-MRF regularization is very helpful to model realistic texture with a new similarity measure. Our confidence-driven reconstruction loss also considers spatially variant constraints. Our future work will be to explore other constraints with location and content.
Similar to other generative neural networks pathak2016context (); yang2017high (); yu2018generative (); yeh2017semantic () for inpainting, our method still has difficulties dealing with large-scale datasets with thousands of diverse object and scene categories, such as ImageNet. When data falls into a few categories, our method works best, since the ambiguity removal in terms of structure and texture can be achieved in these cases.
- F. Agostinelli, M. R. Anderson, and H. Lee. Adaptive multi-column deep neural networks with application to robust image denoising. In NIPS, pages 1493–1501, 2013.
- C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. TOG, 28(3):24, 2009.
- D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In CVPR, pages 3642–3649. IEEE, 2012.
- A. Criminisi, P. Pérez, and K. Toyama. Region filling and object removal by exemplar-based image inpainting. TIP, 13(9):1200–1212, 2004.
- L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, pages 2414–2423. IEEE, 2016.
- I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NIPS, pages 5769–5779, 2017.
- K. He and J. Sun. Statistics of patch offsets for image completion. In ECCV, pages 16–29. Springer, 2012.
- K. He and J. Sun. Image completion approaches using the statistics of similar patches. TPAMI, 36(12):2423–2435, 2014.
- S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. TOG, 36(4):107, 2017.
- J. Jia and C.-K. Tang. Image repairing: Robust image synthesis by adaptive nd tensor voting. In CVPR, volume 1, pages I–I. IEEE, 2003.
- J. Jia and C.-K. Tang. Inference of segmented color and texture description by tensor voting. TPAMI, 26(6):771–786, 2004.
- T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. In CVPR, pages 2479–2486, 2016.
- Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, pages 3730–3738, 2015.
- R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor. Learning to maintain natural image statistics. arXiv preprint arXiv:1803.04626, 2018.
- R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual loss for image transformation with non-aligned data. arXiv preprint arXiv:1803.02077, 2018.
- D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- J. Sun, L. Yuan, J. Jia, and H.-Y. Shum. Image completion with structure propagation. In TOG, volume 24, pages 861–868. ACM, 2005.
- I. Talmi, R. Mechrez, and L. Zelnik-Manor. Template matching with deformable diversity similarity. In CVPR, pages 175–183, 2017.
- Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan. Shift-net: Image inpainting via deep feature rearrangement. arXiv preprint arXiv:1801.09392, 2018.
- C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In CVPR, volume 1, page 3, 2017.
- R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with deep generative models. In CVPR, pages 5485–5493, 2017.
- J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892, 2018.
- Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-image crowd counting via multi-column convolutional neural network. In CVPR, pages 589–597, 2016.
- B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017.