Image Fine-grained Inpainting

Image Fine-grained Inpainting


Image inpainting techniques have shown promising improvement with the assistance of generative adversarial networks (GANs) recently. However, most of them often suffered from completed results with unreasonable structure or blurriness. To mitigate this problem, in this paper, we present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields. Benefited from the property of this network, we can more easily recover large regions in an incomplete image. To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss for concentrating on uncertain areas and enhancing the semantic details. Besides, we devise a geometrical alignment constraint item to compensate for the pixel-based distance between prediction features and ground-truth ones. We also employ a discriminator with local and global branches to ensure local-global contents consistency. To further improve the quality of generated images, discriminator feature matching on the local branch is introduced, which dynamically minimizes the similarity of intermediate features between synthetic and ground-truth patches. Extensive experiments on several public datasets demonstrate that our approach outperforms current state-of-the-art methods. Code is available at


1 Introduction

Image inpainting (a.k.a. image completion) aims to synthesize proper contents in missing regions of an image, which can be used in many applications. For instance, it allows removing unwanted objects in image editing tasks, while filling the contents that are visually realistic and semantically correct. Early approaches to image inpainting are mostly based on patches of low-level features. PatchMatch [2], a typical method, iteratively searches optimal patches to fill in the holes. It can produce plausible results when painting image background or repetitive textures. However, it cannot generate pleasing results for cases where completing regions include complex scenes, faces, and objects, which is due to PatchMatch cannot synthesize new image contents, and missing patches cannot be found in remaining regions for challenging cases.

With the rapid development of deep convolutional neural networks (CNN) and generative adversarial networks (GAN) [5], image inpainting approaches have achieved remarkable success. Pathak \etalproposed context-encoder [19], which employs a deep generative model to predict missing parts of the scene from their surroundings using reconstruction and adversarial losses. Yang \etal [29] introduced style transfer into image inpainting to improve textural quality that propagates the high-frequency textures from the boundary to the hole. Li \etal [16] presented semantic parsing in the generation to restrict synthesized semantically valid contents for the missing facial key parts from random noise. To be able to complete large regions, Iizuka \etal [8] adopted stacked dilated convolutions in their image completion network to obtain lager spatial support and reached realistic results with the assistance of a globally and locally consistent adversarial training approach. Shortly afterward, Yu \etal [30] extended this insight and developed a novel contextual attention layer, which uses the features of known patches as convolutional kernels to compute the correlation between the foreground and background patches. More specifically, they calculated attention score for each pixel and then performed transposed convolution on attention score to reconstruct missing patches with known patches. It might be failing when the relationship between unknown and known patches is not close (\egmasking all of the critical components of a facial image). Wang \etal [26] proposed a generative multi-column convolutional neural network (GMCNN) that uses varied receptive fields in branches by adopting different sizes of convolution kernels (\ie, , and ) in a parallel manner. This method produces advanced performance but suffers from substantial model parameters (12.562M) caused by large convolution kernels. In terms of image quality (more photo-realistic, fewer artifacts), it is still room for improvement.


valign=t Ground-truth Image Input Image CA [30] GMCNN [26] DMFN (Ours)

Figure 1: Visual comparisons on CelebA-HQ. Best viewed with zoom-in.

The goals pursued by image inpainting are ensuring produced images with global semantic structure and finely detailed textures. Additionally, completed image should be approaching the ground truth as much as possible, especially for building and face images. Previous techniques more focus on solving how to yield holistically reasonable and photo-realistic images. This problem has been mitigated by GAN [5] or its improved version WGAN-GP [6] that is frequently utilized in image inpainting methods [19, 8, 29, 16, 30, 24, 28, 26, 33, 31]. However, concerning fine-grained details, there is still much room to enhance. Besides, these existing methods haven’t taken into account the consistency between outputs and targets, \ie, semantic structures should be as much similar as possible for facial images and building images.

To overcome the limitations of the methods as mentioned above, we present a unified generative network for image inpainting, which is denoted as dense multi-scale fusion network (DMFN). The dense multi-scale fusion block (DMFB), serving as the basic block of DMFN, is composed of four-way dilated convolutions as illustrated in Figure 2. This basic block adopts the combination and fusion of hierarchical features extracted from various convolutions with different dilation rates to obtain better multi-scale features, compared with general dilated convolution (dense v.s. sparse). For generating images with the realistic and semantic structure, we design a self-guided regression loss that constrains low-level features of the generated content according to the normalized discrepancy map (the difference between the output and target). Geometrical alignment constraint is developed for penalizing the coordinate center of estimated image high-level features away from the ground-truth. This loss can further help the processing of image fine-grained inpainting. We improve the discriminator using relativistic average GAN (RaGAN) [10]. It is noteworthy that we use global and local branches in the discriminator as in [8], where one branch focuses on the global image while the other concentrates on the local patch of the missing region. To explicitly constraint the output and ground-truth images, we utilize the hidden layers of the local branch that belongs to the whole discriminator to evaluate their discrepancy through an adversarial training process. With all these improvements, the proposed method can produce high-quality results on multiple datasets, including faces, building, and natural scene images.

Our contributions are summarized as follows:

  • We propose a novel self-guided regression loss to explicitly correct the low-level features, according to the normalized error map computed by the output and ground-truth images. This function can significantly improve the semantic structure and fidelity of images.

  • We present a geometrical alignment constraint to supplement the shortage of pixel-based VGG features matching loss.

  • We propose a dense multi-scale fusion generator, which has the merit of strong representation ability to extract useful features. Our generative image inpainting framework achieves compelling visual results (as illustrated in Figure 1) on challenging datasets, compared with previous state-of-the-art approaches.

2 Related Work

A variety of algorithms for image inpainting have been proposed. Traditional diffusion-based methods [3, 1] propagate information from neighboring regions to the holes. They can work well for small and narrow holes, where the texture and color variance are the same. However, these methods fail to recover meaning contents in the large missing regions. Patch-based approaches, such as [4, 14], search for relevant patches from the known regions in an iterative fashion. Simakov \etal [22] proposed bidirectional similarity scheme to capture better and summarize non-stationary visual data. However, these methods are computationally expensive due to calculating the similarity scores of each output-target pair. To relieve this problem, PatchMatch [2] is proposed, which speeds it up by designing a faster similar patch searching algorithm.

Recently, deep learning and GAN-based algorithms have been a remarkable paradigm for image inpainting. Context Encoders (CE) [19] embed the image with a center hole as a low dimensional feature vector and then decode it to a image. Iizuka \etal [8] proposed a high-performance completion network with both global and local discriminators that is critical in obtaining semantically and locally consistent image inpainting results. Also, the authors employ the dilated convolution layers to increase receptive fields of the output neurons. Yang \etal [29] use intermediate features extracted by pre-trained VGG network [23] to find hole’s most similar patch outside the hole. This approach performs multi-scale neural patch synthesis in a coarse-to-fine manner, which noticeably takes a long time to fill a large image during the inference stage. For face completion, Li \etal [16] trained a deep generative model with a combination of reconstruction loss, global and local adversarial losses, and a semantic parsing loss specialized for face images. Contextual Attention (CA) [30] adopted two-stage network architecture where the first step produces a crude result, and the second refinement network using attention mechanism takes the coarse prediction as inputs and improves fine details. Liu \etal [17] introduced partial convolution that employs computational operations only on valid pixels and presented an auto-update binary mask to determinate whether the current pixels are valid. Substituting convolutional layers with partial convolutions can help a UNet-like architecture [9] achieve the state-of-the-art inpainting results. Yan \etal [28] introduced a special shift-connection to the U-Net architecture for enhancing the sharp structures and fine-detailed textures in the filled holes. This method was mainly developed on building and natural landscape images. Similar to [29, 30], Song \etal [24] decoupled the completion process into two stages: coarse inference and fine textures translation. Nazeri \etal [18] also proposed a two-stage network that comprises of an edge generator and an image completion network. Similar to this method, Li \etal [15] progressively incorporated edge information into the feature to output more structured image. Xiong \etal [27] inferred the contours of the objects in the image, then used the completed contours as a guidance to complete the image. Different from frequently-used two-stage processing [20], Sagong \etal [21] proposed parallel path for semantic inpainting to reduce the computational costs.

3 Proposed Method

Figure 2: The architecture of the proposed dense multi-scale fusion block (DMFB). Here, “Conv-3-8” indicates convolution layer with the dilation rate of and is element-wise summation. Instance normalization (IN) and ReLU activation layers followed by the first convolution, second column convolutions and concatenation layer are omitted for brevity. The last convolutional layer only connects an IN layer. The number of output channels for each convolution is set to except for the last convolution (256 channels) in DMFB.
Figure 3: The framework of our method. The activation layer followed by each “convolution + norm” or convolution layer in the generator is omitted for conciseness. The activation function adopts ReLU except for the last convolution (Tanh) in the generator. Blue dotted box indicates our upsampler module (TConv-4 is transposed convolution) and “” denotes the stride of 2.

Our proposed inpainting system is trained in an end-to-end way. Given an input image with hole , its corresponding binary mask (value for known pixels and denotes unknown ones), the output predicted by the network, and the ground-truth image . We take the input image and mask as inputs, \ie, . We now elaborate on our network as follows.

3.1 Network structure

As depicted in Figure 3, our framework consists of a generator, and a discriminator with two branches. The generator produces plausible painted results, and the discriminator conducts adversarial training.

For image inpainting task, the size of the receptive fields should be sufficiently large. The dilated convolution is popularly adopted in the previous works [8, 30] to accomplish this purpose. This way increases the area that can use as input without increasing the number of learnable weights. However, the kernel of dilated convolution is sparse, which skips many pixels during applying to compute. Large convolution kernel (\eg) is applied in [26] to implement this intention. However, this solution introduces heavy model parameters. To enlarge the receptive fields and ensure dense convolution kernels simultaneously, we propose our dense multi-scale fusion block (DMFB, see in Figure 2) inspired by [7]. Specifically, the first convolution on the left in DMFB reduces the channels of input features to for decreasing the parameters, and then these processed features are sent to four branches to extract multi-scale features, denoted as (), by using dilated convolutions with different dilation factors. Except for , each has a corresponding convolution, denoted by . Through a cumulative addition fashion, we can get dense multi-scale features from the combination of various sparse multi-scale features. We denote by the output of . The combination part can be formulated as


The following step is the fusion of concatenated features simply using a convolution. In a word, this basic block especially enhances the general dilated convolution and has fewer parameters than large kernels.

Different from previous generative inpainting networks [30, 26] that apply WGAN-GP [6] for adversarial training, we propose to use RaGAN [10] to pursue more photo-realistic generated images [25]. This discriminator also considers the consistency of global and local images.

3.2 Loss functions

Figure 4: Visualization of average VGG feature maps.

Self-guided regression loss

Here, we address the semantic structure preservation issue. We scheme to take self-guided regression constraint to correct the image semantic level estimation. Briefly, we compute the discrepancy map between generated contents and corresponding ground truth to navigate the similarity measure of the feature map hierarchy from the pre-trained VGG19 [23] network. At first, we investigate the characteristic of VGG feature maps. Given an input image , it is first fed forward through the VGG19 to yield a five-level feature map pyramid, where their spatial resolution reduces low progressively. Specifically, the -th () level is set to the feature tensor produced by relu_1 layer of VGG19. These feature tensors are denoted by . We give an illustration of average feature maps in Figure 4, which suggests that the deeper layers of a pre-trained network represent higher-level semantic information, while lower-level features more focus on textural or structural details, such as edges, corners, and other simple conjunctions. In this paper, we would intend to improve the detail fidelity of the completed image, especially for building and face images.

Figure 5: Visualization of guidance maps. (Left) Guidance map for “relu1_1” layer. (Right) Guidance map for “relu2_1” layer. These are corresponding to Figure 4.

To this end, through the error map between the output image produced by the generator and ground truth, we get the guidance map to distinguish between areas of challenging and manageable. Therefore, we propose to use the following equation to gain the average error map:


where are the three color channels, denotes -th channel of the output image. Then, the normalized guidance mask can be calculated by


where is the error map value at position . Note that our guidance mask with continuous values between and , which is soft instead of binary. corresponds -th level feature maps and it can be expressed by


where denotes average pooling with kernel size of and stride of . Here, (Equation 3). In this way, the value range of is still between and . In view of the fact that lower-level feature map contains more detailed information, we choose feature tensors from “relu1_1” and “relu2_1” layers to describe image semantic structures. Thus, our self-guided regression loss is defined as


where is the activation map of the relu_1 layer given original input , is the number of elements in , is the element-wise product operator, and followed by [35]. Here, is the channel size of feature map .

An obvious benefit for this regularization is to suppress regions with higher uncertainty (as shown in Figure 5). can be viewed as a spatial attention map, which preferably optimizes areas that are difficult to handle. Our self-guided regression loss is performed lower-level semantic space instead of pixel space. The merit of this way would appear in the perceptual image synthesis with pleasant structural information.

Geometrical alignment constraint

In the typical solutions, the metric evaluation in higher-level feature space is only achieved using pixel-based loss, \eg, L1 or L2. It doesn’t take the alignment of each high-level feature map semantic hub into account. To better measure the distance between high-level features belong to prediction and ground-truth, we impose the geometrical alignment constraint on the response maps of “relu4_1” layer. This term can help the generator create a plausible image that aligned with the target image in position. Specifically, this term encourages the output feature center to be spatially close to the target feature center. The geometrical center for the k-th feature map along axis is calculated as


where response maps . represents a spatial probability distribution function. denotes coordinate expectation along axis . Then, we pass both the completed image and ground-truth image through the VGG network and obtain the corresponding response maps and . Given these response maps, we compute the centers and using Equation 6. Then, we formulate the geometrical alignment constraint as


Feature matching losses

The VGG feature matching loss compares the activation maps in the intermediate layers of well-trained VGG19 [23] model, which can be written as


where is the number of elements in . We also introduce local branch in discriminator feature matching loss , which is reasonable to assume that the output image are consistent with the ground-truth images under any measurements (\ie, any high-dimensional spaces). This feature matching loss is defined as


where is the activation in the -th selected layer of the discriminator given input (see in Figure 3). Note that the hidden layers of the discriminator are trainable, which is slightly different from the well-trained VGG19 network trained on the ImageNet dataset. It can adaptively update based on specific training data. This complementary feature matching can dynamically extract features that may be not mined in VGG model.

Adversarial loss

For improving the visual quality of inpainted results, we use relativistic average discriminator [10] as in ESRGAN [25], which is the recent state-of-the-art perceptual image super-resolution algorithm. For the generator, the adversarial loss is defined as


where and indicates the discriminator network without the last sigmoid function. Here, real/fake data pairs are sampled from ground-truth and output images.

Final objective

With self-guided regression loss, geometrical alignment constraint, VGG feature matching loss, discriminator feature matching loss, adversarial loss, and mean absolute error (MAE) loss, our overall loss function is defined as


where , , , and are used to balance the effects between the losses mentioned above.

4 Experiments

We evaluate the proposed inpainting model on Paris Street View [19], Places2 [34], CelebA-HQ [11], and a new challenging facial dataset FFHQ [12].

4.1 Experimental settings

For our experiments, we set , , and in Equation 11. The training procedure is optimized using Adam optimizer [13] with and . We set the learning rate to . The batch size is . We apply PyTorch framework to implement our model and train them using NVIDIA TITAN Xp GPU (12GB memory).

For training, given a raw image , a binary image mask (value for known pixels and denotes unknown ones) at a random position. In this way, the input image is obtained from the raw image as . Our inpainting generator takes as input, and produces prediction . The final output image is . All input and output are linearly scaled to . We train our network on the training set and evaluate it on the validation set (Places2, CelebA-HQ, and FFHQ) or testing set (Paris street view and CelebA). For training, we use images of resolution with the largest hole size as in [30, 26]. For Paris street view (), we randomly crop patches with resolution and then scale down them to for training. Similarly, for Places2 (), sub-images are cropped at a random location. These images are scaled down to for our model. For CelebA-HQ and FFHQ face datasets (), images are directly scaled to . We use the irregular mask dataset provided by [17]. All results generated by our model are not post-processed.

4.2 Qualitative comparisons


valign=t Input Image CE [19] Shift-Net [28] GMCNN [26] PICNet [33] DMFN (Ours)

Figure 6: Visual comparisons on Paris street view.

valign=t Input GT PICNet [33] DMFN (Ours)

Figure 7: Visual comparisons on Places2. Best viewed with zoom-in.

valign=t Input DMFN (Ours) GT

Figure 8: Visual results on FFHQ dataset.

valign=t Input GT PICNet [33] DMFN (Ours)

Figure 9: Inpainted images with irregular masks on Paris StreetView and CelebA-HQ. Best viewed with zoom in.

As shown in Figures 61, and 7, compared with other state-of-the-art methods, our model gives a noticeable visual improvement on textures and structures. For instance, our network generates plausible image structures in Figure 6, which mainly stems from the dense multi-scale fusion architecture and well-designed losses. The realistic textures are hallucinated via feature matching and adversarial training. For Figure 1, we show that our results with more realistic details and fewer artifacts than the compared approaches. Besides, we give partial results of our method and PICNet [33] on Places2 dataset in Figure 7. The proposed DMFN creates more reasonable, natural, and photo-realistic images. Additionally, we also show some example results (masks at random position) of our model trained on FFHQ in Figure 8. In Figure 9, our method performs more stable and fine for large-area irregular masks than compared algorithms. More compelling results can be found in the supplementary material.

4.3 Quantitative comparisons

Method Paris street view (100) Places2 (100) CelebA-HQ (2,000) FFHQ (10,000)
CA [30] N/A 0.1524 / 21.32 / 0.8010 0.0724 / 24.13 / 0.8661 N/A
GMCNN [26] 0.1243 / 24.38 / 0.8444 0.1829 / 19.51 / 0.7817 0.0509 / 25.88 / 0.8879 N/A
PICNet [33] 0.1263 / 23.79 / 0.8314 0.1622 / 20.70 / 0.7931 N/A N/A
DMFN (Ours) 0.1018 / 25.00 / 0.8563 0.1361 / 21.53 / 0.8079 0.0460 / 26.50 / 0.8932 0.0457 / 26.49 / 0.8985
Table 1: Quantitative results (center regular mask) on four testing datasets.

Following [30, 26], we measure the quality of our results using peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). Learned perceptual image patch similarity (LPIPS) [32] is a new metric that can better evaluate the perceptual similarity between two images. Because the purpose of image inpainting is to pursue visual effects, we adopt LPIPS as the main qualitative assessment. The lower the values of LPIPS, the better. In Places2, 100 validation images from “canyon” scene category are chosen for evaluation. As shown in Table 1, our method produces acceptable results compared with CA [30], GMCNN [26], and PICNet [33] in terms of all evaluation measurements.

We also conducted user studies as illustrated in Figure 10. The scheme is based on blind randomized A/B/C tests deployed on Google Forms platform as in [26]. Each survey includes single-choice questions. Each question involves three options (completed images that are generated from the same corrupted input by three different methods). There are participants invited to accomplish this survey. They are asked to select to the most realistic item in each question. The option order is shuffled each time. Finally, our method outperforms compared approaches by a large margin.

Figure 10: Results of user study.

4.4 Ablation study

Model rate=2 rate=8 w/o combination w/o DMFB
Params 803,392 803,392 361,024 361,024 471,808
LPIPS 0.1059 0.1067 0.1083 0.1026 0.1018
PSNR 24.93 24.91 24.24 24.93 25.00
SSIM 0.8530 0.8549 0.8505 0.8561 0.8563
Table 2: Quantitative results of different structures on Paris street view dataset.
Metric w/o self-guided w/o align w/o dis_fm with all
LPIPS 0.0537 0.0534 0.0542 0.0530
PSNR 25.73 25.63 25.65 25.83
SSIM 0.8892 0.8884 0.8870 0.8892
Table 3: Investigation of self-guided regression loss and geometrical alignment constraint.
Input rate=2 rate=8 w/o combination w/o DMFB (Ours)
Figure 11: Visual comparison of different structures. Best viewed with zoom-in.
Input w/o self-guided w/o alignment with all
Figure 12: Visual comparison of results using different losses. Best viewed with zoom-in.

Effectiveness of DMFB

To validate the representation ability of our DMFB, we replace its middle part (4 dilated convolutions and combination operation) to a dilated convolution (256 channels) with dilation rate of or (“rate=2” or “rate=8”, see in Table 2). Additionally, to verify the strength of in combination operation, we perform the DMFB without that denoted as “w/o ” in Table 2. Combined with Table 2 and Figure 11, we can clearly see that our model with DMFB (Parms: ) predicts reasonable and less artifact images than ordinary dilated convolutions (Parms: ). Meanwhile, the results of “rate=2” and “rate=8” suggest the importance of spatial support as discussed in [8]. It also demonstrates large and dense receptive field is beneficial to completing images with large holes.

Self-guided regression and geometrical alignment constraint

To investigate the effect of proposed self-guided regression loss and geometrical alignment constraint, we train a complete DMFN on CelebA-HQ dataset without the corresponding loss. As shown in Figure 12, “w/o self-guided” model cannot restore some structural details and “w/o alignment” item shows some misalignment in the yellow box, while “with all” model (DMFN trained all losses) can mitigate these problems. And we give the quantitative results in Table 3, which validates the effectiveness of various proposed losses. More discussions about loss functions are provided in the supplementary material.

5 Conclusion

In this paper, we proposed a dense multi-scale fusion network with self-guided regression loss and geometrical alignment constraint for image fine-grained inpainting, which highly improves the quality of produced images. Specifically, dense multi-scale fusion block is developed to extracted better features. With the assistance of self-guided regression loss, the restoration of semantic structures becomes easier. Additionally, geometrical alignment constraint is inductive to the coordinate registration between generated image and ground-truth, which promotes the reasonableness of painted results.


  1. C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro and J. Verdera (2001) Filling-in by joint interpolation of vector fields and gray levels. IEEE Transactions on Image Processing 10 (8), pp. 1200–1211. Cited by: §2.
  2. C. Barnes, E. Shechtman, A. Finkelstein and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG) 28 (3), pp. 24:1–24:11. Cited by: §1, §2.
  3. M. Bertalmio, G. Sapiro, V. Caselles and C. Ballester (2000) Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 417–424. Cited by: §2.
  4. A. A. Efros and W. T. Freeman (2001) Image quilting for texture synthesis and transfer. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 341–346. Cited by: §2.
  5. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, pp. 2672–2680. Cited by: §1, §1.
  6. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin and A. C. Courville (2017) Improved training of wasserstein gans. In NeurIPS, pp. 5767–5777. Cited by: §1, §3.1.
  7. Z. Hui, J. Li, X. Gao and X. Wang (2019) Progressive perception-oriented network for single image super-resolution. arXiv:1907.10399v1. Cited by: §3.1.
  8. S. Iizuka, E. Simo-Serra and H. Ishikawa (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (TOG) 36 (4), pp. 107:1–107:14. Cited by: §1, §1, §1, §2, §3.1, §4.4.1.
  9. P. Isola, J. Zhu, T. Zhou and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, pp. 1125–1134. Cited by: §2.
  10. A. Jolicoeur-Martineau (2019) The relativistic discriminator: a key element missing from standard gan. In ICLR, Cited by: §1, §3.1, §3.2.4.
  11. T. Karras, T. Aila, S. Laine and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, pp. . Cited by: §4.
  12. T. Karras, S. Laine and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, pp. 4401–4410. Cited by: §4.
  13. D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  14. V. Kwatra, I. Essa, A. Bobick and N. Kwatra (2005) Texture optimization for example-based synthesis. ACM Transactions on Graphics (TOG) 24 (3), pp. 795–802. Cited by: §2.
  15. J. Li, F. He, L. Zhang, B. Du and D. Tao (2019) Progressive reconstruction of visual structure for image inpainting. In ICCV, pp. 5962–5971. Cited by: §2.
  16. Y. Li, S. Liu, J. Yang and M. Yang (2017) Generative face completion. In CVPR, pp. 3911–3919. Cited by: §1, §1, §2.
  17. G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In ECCV, pp. 85–100. Cited by: §2, §4.1.
  18. K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi and M. Ebrahimi (2019) EdgeConnect: structure guided image inpainting using edge prediction. In ICCVW, Cited by: §2.
  19. D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPP, pp. 2536–2544. Cited by: §1, §1, §2, Figure 6, §4.
  20. Y. Ren, X. Yu, R. Zhang, T. H. Li, S. Liu and G. Li (2019) StructureFlow: image inpainting via structure-aware appearance flow. In ICCV, pp. 181–190. Cited by: §2.
  21. M. Sagong, Y. Shin, S. Kim, S. Park and S. Ko (2019) PEPSI: fast image inpainting with parallel decoding network. In CVPR, pp. 11360–11368. Cited by: §2.
  22. D. Simakov, Y. Caspi, E. Shechtman and M. Irani (2008) Summarizing visual data using bidirectional similarity. In CVPR, pp. 1–8. Cited by: §2.
  23. K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §2, §3.2.1, §3.2.3.
  24. Y. Song, C. Yang, Z. Lin, X. Liu, Q. Huang, H. Li and C.-C. Jay Kuo (2018) Contextual-based image inpainting: infer, match, and translate. In ECCV, pp. 3–19. Cited by: §1, §2.
  25. X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao and C. C. Loy (2018) ESRGAN: enhanced super-resolution generative adversarial networks. In ECCVW, pp. 63–79. Cited by: §3.1, §3.2.4.
  26. Y. Wang, X. Tao, X. Qi, X. Shen and J. Jia (2018) Image inpainting via generative mulit-column convolutional neural networks. In NeurIPS, pp. 331–340. Cited by: Figure 1, §1, §1, §3.1, §3.1, Figure 6, §4.1, §4.3, §4.3, Table 1.
  27. W. Xiong, Z. Lin, J. Yang, X. Lu, C. Barnes and J. Luo (2019) Foreground-aware image inpainting. In CVPR, pp. 5840–5848. Cited by: §2.
  28. Z. Yan, X. Li, M. Li, W. Zuo and S. Shan (2018) Shift-net: image inpainting via deep feature rearrangement. In ECCV, pp. 1–17. Cited by: §1, §2, Figure 6.
  29. C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang and H. Li (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In CVPR, pp. 6721–6729. Cited by: §1, §1, §2.
  30. J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu and T. S. Huang (2018) Generative image inpainting with contextual attention. In CVPR, pp. 5505–5514. Cited by: Figure 1, §1, §1, §2, §3.1, §3.1, §4.1, §4.3, Table 1.
  31. Y. Zeng, J. Fu, H. Chao and B. Guo (2019) Learning pyramid-context encoder network for high-quality image inpainting. In CVPR, pp. 1486–1494. Cited by: §1.
  32. R. Zhang, P. Isola, A. A. Efros, E. Shechtman and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp. 586–595. Cited by: §4.3.
  33. C. Zheng, T. Cham and J. Cai (2019) Pluralistic image completion. In CVPR, pp. 1438–1447. Cited by: §1, Figure 6, Figure 7, Figure 9, §4.2, §4.3, Table 1.
  34. B. Zhou, A. Lapedriza, A. Khosla, A. Oliva and A. Torralba Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6), pp. 1452–1464. Cited by: §4.
  35. Y. Zhou, Z. Zhu, X. Bai, D. Lischinski, D. Cohen-Or and H. Huang (2018) Non-stationary texture synthesis by adversarial expansion. ACM Transactions on Graphics (TOG) 37 (4), pp. 49:1–49:13. Cited by: §3.2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description