DeepDeblur: Fast one-step blurry face images restoration
We propose a very fast and effective one-step restoring method for blurry face images. In the last decades, many blind deblurring algorithms have been proposed to restore latent sharp images. However, these algorithms run slowly because of involving two steps: kernel estimation and following non-blind deconvolution or latent image estimation. Also they cannot handle face images in small size. Our proposed method restores sharp face images directly in one step using Convolutional Neural Network. Unlike previous deep learning involved methods that can only handle a single blur kernel at one time, our network is trained on totally random and numerous training sample pairs to deal with the variances due to different blur kernels in practice. A smoothness regularization as well as a facial regularization are added to keep facial identity information which is the key to face image applications. Comprehensive experiments demonstrate that our proposed method can handle various blur kenels and achieve state-of-the-art results for small size blurry face images restoration. Moreover, the proposed method shows significant improvement in face recognition accuracy along with increasing running speed by more than 100 times.
Single image deblurring is an ill-posed task which is normally considered as blind deconvolution process where both sharp image and blur kernel are unknown. Although many blind debluring progresses have been achieved recently [11, 13, 25, 2] , applications like face recognition or identification still encouter troubles when face images are blurry due to camera shake or fast move. Many state-of-the-art algorithms [25, 4, 1, 21, 13, 16] decompose deblurring into two stages of kernel estimation and sharp latent image estimation or non-blind deconvolution. Methods in [13, 2, 14, 18] estimate latent sharp images and its corresponding blur kernels iteratively. They use salient edges and strong prior for kernel estimation of natural images. Levin et al.  proposes a simpler algorithm using parse derivative prior to avoid artifacts based on Fergus et al. . Whyte et al.  steps forward to discuss non-uniform blur kernel across the image. Pan et al.  observes properties of text image priors and optimize deblurring. Those two-step method rely on blur kernel estimation and assume that kernels have no errors when doing non-blind deconvolution. However, Shan et al.  proves that even small kernel errors or image noise can lead to significant artifacts. Those existing methods can only handel natural images in large size while they fail to deblur face images which are in very small size and have little distinct sharp edges. Also, a common drawback of these methods is slow running speed which is ineffective in practice.
Deep learning has been recently widely used in image processing problems for its strong representative ability and fast running speed [23, 3, 6, 20]. However, training deep neural network with image pairs for image deblurring is very difficult because different blur kernels lead to large variance . Xu et al.  proposes a non-blind deblurring method based on Convolutional Neural Network using large 1D kernels but it needs to be trained specifically for different kernels. Sun et al.  trains a classification CNN network with 70 fixed blur kernels so it cannot make a generalization. Son et al.  applys Wiener filter to non-blind deconvolution and trains a residual network to remove artifacts and noise amplified by Wiener fitering. In , a neural network is proposed to predict the complex Fourier coefficients of a deconvolution filter to be applied to the input patch. Because it need many overlapping patches of inputs, it cannot deblur images in small size. Schuler et al.  designs an iterative model for blind deblurring and it uses MLP structure for feature extraction module of each iteration. This model still consists of kernel estimation and image estimation modules and its performance is not comparable to the state-of-the-art for big blur kernels.
In this paper, we propose a one-step restoring method using CNN for blurry face images. Schematic of our method is shown in Figure 1. A multi-scale deep residual network is designed to handle various blur kernels and trained on random generated training sample pairs. Our restoration model improves image quality as well as facial identity of each face image by adding smoothness and facial regularizations. Compared with previous debluring methods, the proposed algorithm is simple with one direct step and requires no prior or post-processing which are inevitably used in [13, 4, 23]. Moreover, experiments show that our method can restore small blurry face images while existing algorithms may fail or collapses. Implenting deep neural network model is very fast using GPUs, and our model runs more than 100 times faster than previous debluring methods. Meanwhile, it achieves comparable results in PSNR and verification accuracy of restored face images.
2 One-step Blurry Face Images Restoration
Single image blurring is normally modeled as a convolution process between an unknown latent sharp image and a blur kernel with an additional noise,
where is the blurred image and is usually unknown in practice. Correspondingly, deblurring processing can be seen as solving a blind deconvolution problem which is obviously ill-posed. Given a blurred image , and are usually estimated by solving
Where is added as a noise term. There are usually two steps in previous works, blur kernel estimation based on priors and a following non-blind deconvolution. These two-step methods strongly rely on accuracy of estimated kernel. The size of a blur kernel is comparatively small to the original sharp image. Thus, a tiny variation or error will be amplified during deconvolution and lead to artifacts. Besides, it’s always difficult for two-step methods to handle when it comes to non-uniform blurring circumstances.
We find that blurred images often cause a failure of a face verification model (see section 4.5). Restoration of a blurred single face image also plays an important role in safety and security field. Although many methods try to solve the general deblurring problem whose targets doesn’t have specific content and objects, restoring sharp face images is still a huge challenge. Unlike general deblurring datasets which have a typical height/width from 1000 to 2000 pixels, face images are very small. In addition, sharp edges are absent in face images, which dramaticly increase difficulty of the deblur process compared to natural images deblurring. However, from face verification tasks it has been found that human faces have discriminative representation features which can be useful in restoration.
Thus, we propose a one-step solution from blurred to deblurred face images using Convolutional Neural Network. The basic training of network is written as
where nonlinear operation denotes neural network. There is no explicitly blur kernel estimation in our one-step method. Also, our CNN can automatically sort and combine features that extracted by itself from original blurred face images.
In this way, our one-step CNN method overcome influences of kernel estimation errors without priors of blur kernel and run very fast during test. Restoration network also benefit the face verification which is essential in applications.
3 CNN Based Restoration
Below, we describe our one-step restoring method which are illustrated in Figure 1. We introduce novel regularizations to model training process. Samples sythesizing and network architecture are also included in this section.
3.1 Generating Blur Kernels for CNN Training
Although deblurring CNN needs no priors of blur kernel, the key to training CNN is sufficient training samples with different kinds of motion blurs as many as possible. Blur degradation often comes from relative motion between camera and objects in scene. There are 6 degrees of freedom (3 translations and 3 rotations) in camera motion and they are projected in 2D image plane . So we take the complexity of real blur into consideration and sythesize training kernels as realistic as possible. To avoid overfitting because of limited kernel types, kernels are synthetized as described in . It assumes that both coordinates of a 2D kernel follow a gaussian process with kernel function ,
where and are set as . Here, we set the sampling margin to and control sampling length to obtain infinite motion kernels. We also randomize valid kernel size in order to stimulate different real blurs scales. Finally, kernels are normalized and scaled to one fixed size with realistic shapes show in Figure 2.
Our deblur model is trained with pairs of a sharp face images and the corresponding blurry images generated online by kernels described above. The variable kernel shapes ensure generalization ability of our model and avoid overfitting in finite datasets. Our model breaks the limit of specific training for different kernels and implements blind debluring.
3.2 Smoothness and Facial Regularizations
One major contribution of our method is to restore face images as well keeping individual’s identification and structure information. To achieve this goal, multiple regularizations or losses are used for model training which is illustrated in Figure 1. First, we adopt L2 regularization between result image restored by the network and its ground-truth sharp image using
where , are width and height of an image, is channel number, and denote indexes along these three dimensions. L2 regularization computes distance between two images in Eulerian space and it is the most common loss function for training. It provides gurantee for training process to reach roughly convergence. However, L2 regularization just neglects texture details and information for recognition. Also, there are unrealistic noisy region in the generated image after deep neural network. TV loss(Total Variation loss) in  is adopted to resolve this problem. TV loss measures smoothness of an image which can avoid abrupt changes and eliminate artifacts. It is written as
TV loss can be seen as the sum of local gradient squared norm between pixels. It is also a special case of “sparse prior” in . It encourages training to converge to smooth images instead of local minimum of L2 loss function.
To keep structure information during restoration, we then propose a facial loss for facial identity preserving. Content loss in  or facial semantic structure loss in  both use another deep network to extract presentative features directly from image data. For better performance verification tasks, we extract fully connected layer feature of our pre-trained CNN model based on ResNet trained for face recognition instead of using feature map in convolution layer of VGG. The last fully connected layer has much less parameters than convolution layers so it avoids introducing more computations. Thus, facial loss is defined as the L2 distance between extracted feaures of ground-truth and restored image:
where denotes the final fully connected layer of face recognition model. Then, the total loss function for training is formulated as the weighted combination of L2-loss, TV loss and facial loss:
where and are balancing factors when L2-loss weight is fixed to 1. The total loss function regulates the training into convergence while maintaining facial information and reducing noise during restoration. More trainning details and analyses are in section 4.1 and 4.6.
3.3 CNN Architecture
Restoring network in Figure 1 is designed as ResNet  structure with Inception modules. In order to handle those unknown blurry images in all different blur kernels, multi-scale convolution is adopted to extract features including both blurred edge details and long distance movements. Each inception block in Figure 3 has five convolution kernel scales from size 1 to size 14. The larger kernels strengthen lines-like response in larger scale while the smaller kernels extract texture details. As shown in Figure 4, and kernels focus more on texture features than kernels, and responses of kernels are more abstract. Also, responses of the blurry image under multi-scale convolutions show similar shapes corresponding to the blur kernel. It indicates that our network can distinguish implicit blur kernel shape in a single blurry image. This ability of feature representation benefits deblurred image restoration by selecting and non-linear combining of feature maps.
Multi-scale structure enlarges network’s width that normally can improve model’s performance. However, computation complexity rises rapidly when network’s depth and width increase. In every inception block we use depth-wise convolution to reduce channel dimension to half of it before each standard expensive convolution layer. Then, outputs of different convolution size are concatenated in channel dimension and reduced again. Additional improvement is analysed in section 4.6.
As shown in Figure 3, this network has six Inception modules which have multi-scale convolution layers respectively. Each module’s input is added to its output and modules are stacked upon each other. First convolution layer out of Inception modules has convolution kernels and outputs 64 channels. Final convolution layer reduces the output to three channels with convolution and gets the restoring results. We use leaky ReLU in  as the nonlinear activation function following convolution layers.
Thus, our designed architecture with multi-scale convolutions makes it possible for restoring model to deal with complicated inputs and enhance adaptability of various blur scales. Meanwhile, reduction using convolution kernel avoids explosive growth of model parameters. This design enables training process to be controllable and reduces computational complexity.
We use CASIA WebFace Database  as training sets. It has 494,414 images from 10575 diffrent subjects. All images are aligned and cropped into by landmarks. The network is trained with pairs of sharp image and its blurry image synthetized with kernel describe in section 3.1. Valid kernel size is randomly chosen and the mass center of kernel is shifted to the middle.
We adopt RMSProp optimizer and apply exponential decay to the learning rate. At the begining, learning rate is set to 0.001 and it decays along with steps increasing. We keep learning rate small and train for a long time to prevent divergence and gradient exploding. When loss terms stop decreasing the learning rate is cut to half of it. Balancing weights between TV and facial loss terms are adjusted manually during different stages of training process. The training is stopped when restored latent images look real and sharp as  do.
4.2 Implentation Details
We implement our method and perform experiments with various images and blur kernels on Intel Xeon E5 CPU and NVIDIA GeForce GTX 1080 GPU. Time consuming and restored results of our method are compared with several other methods [10, 13, 18, 19, 16, 1] . Those methods’ MATLAB implentations are published online by their authors. Our model is trained and tested with Tensorflow. Since our method is one-step that estimating sharp images from blurry inputs, test stage is a simple network forwarding process.
Because our method aims at restoring blurry face images, experiments for the evaluation are performed on widely used benchmark face datasets such as LFW  and FaceScrub . All test images are aligned and cropped in the same way as training. Then, those images are convolved with eight blur kernels of different sizes from  and are shown in the first row of Figure 6 . We also test our method on linear motion kernel. All the results deblurred by different methods are saved for quantitative and qualitative comparision.
4.3 Time Consuming Comparision
Tabel 1 shows the averaged time consuming of resotring a single face image in size . Those methods are implented with their default settings and parameters. Methods which use EPLL as the non-blind deconvolution step[25, 1, 16] are implemented with the same EPLL code package. Although training deep neural network needs a large amount of time, inference time during test is independent with training. Back-propagation of loss function during training does not affect inference time. Without itertations between latent image and estimated blur kernel, our method runs very fast in the test stage using GPU. Our method outperforms the fastest algorithm more than 100 times shown in Tabel 1.
|Method||Sun et al. ||Chakrabarti ||Krishnan et al. ||Levin et al. |
|Method||Michaeli et al. ||Pan et al. ||Perrone et al. ||ours|
|kernel 1||kernel 2||kernel 3||kernel 4||kernel 5||kernel 6||kernel 7||kernel 8||mean|
|Krishnan et al.||92%||88%||92%||82%||92%||86%||96%||78%||88.25%|
|Pan et al.||86%||80%||84%||64%||84%||66%||72%||72%||76%|
|Levin et al.||92%||80%||96%||48%||98%||74%||70%||62%||77.5%|
|Perr et al.||46%||52%||46%||48%||62%||48%||60%||66%||53.5%|
|Sun et al.||86%||88%||94%||72%||90%||80%||86%||78%||84.25%|
4.4 Qualitative Results
We test five methods in Tabel 1 which can obtain meaningful deblurred face images on eight type blurry test datasets and compare them with ours. A restored example are shown in Figure 5. The blurry input in subfigure 5(a) is sythesized with the right-bottom kernel from original image. 5(b) and 5(c) are in grayscale because Levin et al.  and Sun et al.  can only handle one channel of a RGB image. Except Krishnan et al.  in 5(f) sharpen edges in blurry input to some extent, other methods all fail to restore a distinguishable human face due to severe blur and small input size. Although kernel scale is relatively large to image size( vs. ), our method still performs a fairly good result with a distinct face.
Figure 6 illustrates that our model can handle blur kernels in different size and shape. Results in third row show that our model has generalization capability since all test blur kernels have never appeared in training. Compared to blurry degraded images in Figure 6 second row, one-step restoring method improves face image quality of visualization significantly.
4.5 Quantitative Results
Our method is evaluated in terms of recognition rate and PSNR on face datasets and benchmark of blur kernels in . Because traditional methods consume too much time(Table 1), we have to choose 50 pairs randomly from 3000 test pairs of LFW and compute quantitative results on this small size datasets. Performance comparision is illustrated in Figure 7. Krishnan et al.  and Levin et al.  show better results than Sun et al. , Perrone et al.  and Pan et al. . Performance of our method falls a little when handling large blur kernel size, but it’s still beyond others a lot.
Restored face images should maintain their identity information which is useful for face recognition. For the chosen test pairs of LFW, one of each pair keeps its orginal image while the other is restored from the blurry image. Blurry images are also synthetized by eight kernel in . We apply a face verification model trained on MS-Celeb-1M  to test the verification accuracy of all face image pairs restored from different methods. Table 2 shows verification accuracy of all restoration methods. By reconstructing facial structure and information, our model achieves the best accuracy performance. We also test our model on all images of LFW thanks to very fast running speed and show the results in Table 3. It shows that blurry images affect verification accuracy badly especially when blur scale is large. Our method can eliminate this influence obviously.
Tabel 2 and Table 3 show that blur disturbance in large scale may reduce performance of verification model sharply while small size blurs have much slighter influence. Our restoring method can benefit face verification no matter how the blur situation changes.
PSNR of linear motion deblurred results is also illustrated in Table 4. Linear kernel has length of 15 pixels and angle of . We compare ours with the superior one  of other methods and compute PSNR. Although linear motion kernels never appeared in our training, our model also outputs plausible results.
We also evaluate our method on FaceScrub and compare it with [13, 10] shown in Figure 8. Our method obtain better restoration performances for both different face images and various blur kernels. Comprehensive experiments demonstrate that our method can improve images quality cross face datasets and blur kernels.
|kernel 1||kernel 2||kernel 3||kernel 4|
|kernel 5||kernel 6||kernel 7||kernel 8|
4.6 Analysis on Network and Regularization
In each inception module, depth-wise convolution is adopted to reduce depth dimension which saves nearly of model size. A new model is built without convolution reduction and following kernels are modified to , is each kernel’s size. The modified network is trained with the same iterations and tested on LFW database. During training we found that modified model reaches to convergence slower than original model. Tabel 5 shows that depth-wise convolution layers also enable deep network to achieve a better performance besides reduce computational complexity.
Regularization terms in Eq. 8 help the network to avoid local minimum and overfitting. The ratio between facial loss and TV loss decides quality of restored images. Figure 9 shows different results restored of models trained by different regularization weights. When TV loss contributes much less than facial loss, there are noisy artifacts in results. On the other hand, increasing to 5 suppresses texture features and leads to over-smoothness. Balancing regularization terms during training plays an important role of the implentation.
|without depth-wise conv||23.05||88.5|
In this paper, a very fast and effective one-step restoring method for blurry face images is proposed. We design a CNN to handle various blur kernels without any priors or estimations. Proposed method breaks the training limit of deep learning method that can only handle specific blurs. Although face images are lack of salient edges and relatively small, our restoration model improves image quality significantly and keeps facial identity by adding smoothness and facial regularizations. Our model runs at high speed which is compatible in recognition applications. Going deep into real blurry images will be interesting future work.
-  A. Chakrabarti. A neural approach to blind motion deblurring. pages 221–235, 2016.
-  S. Cho and S. Lee. Fast motion deblurring. In Acm Siggraph Asia, pages 1–8, 2009.
-  C. Dong, C. L. Chen, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. 8692:184–199, 2014.
-  R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman. Removing camera shake from a single photograph. In Acm Siggraph, pages 787–794, 2006.
-  Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MS-Celeb-1M: A dataset and benchmark for large scale face recognition. In European Conference on Computer Vision. Springer, 2016.
-  S. Harmeling. Image denoising: Can plain neural networks compete with bm3d? In Computer Vision and Pattern Recognition, pages 2392–2399, 2012.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. pages 770–778, 2015.
-  G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
-  P. Kaur, H. Zhang, and K. J. Dana. Photo-realistic facial texture transfer. 2017.
-  D. Krishnan, T. Tay, and R. Fergus. Blind deconvolution using a normalized sparsity measure. In Computer Vision and Pattern Recognition, pages 233–240, 2011.
-  W. S. Lai, J. B. Huang, Z. Hu, N. Ahuja, and M. H. Yang. A comparative study for single image blind deblurring. In Computer Vision and Pattern Recognition, pages 1701–1709, 2016.
-  C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and Z. Wang. Photo-realistic single image super-resolution using a generative adversarial network. 2016.
-  A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Efficient marginal likelihood optimization in blind deconvolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2657–2664, 2011.
-  T. H. Li and K. S. Lii. A joint estimation approach for two-tone image deblurring by blind deconvolution. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 11(8):847–58, 2002.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, 2013.
-  T. Michaeli and M. Irani. Blind deblurring using internal patch recurrence. 8691:783–798, 2014.
-  H. W. Ng and S. Winkler. A data-driven approach to cleaning large face datasets. In IEEE International Conference on Image Processing, pages 343–347, 2015.
-  J. Pan, Z. Hu, Z. Su, and M. H. Yang. Deblurring text images via l0-regularized intensity and gradient prior. In Computer Vision and Pattern Recognition, pages 2901–2908, 2014.
-  D. Perrone and P. Favaro. Total variation blind deconvolution: The devil is in the details. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2909–2916, 2014.
-  C. J. Schuler, M. Hirsch, S. Harmeling, and B. SchÃ¶lkopf. Learning to deblur. IEEE Transactions on Pattern Analysis & Machine Intelligence, 38(7):1439–1451, 2016.
-  Q. Shan, J. Jia, and A. Agarwala. High-quality motion deblurring from a single image. Acm Transactions on Graphics, 27(3):1–10, 2008.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. 2016.
-  H. Son and S. Lee. Fast non-blind deconvolution via regularized residual networks with long/short skip-connections. In IEEE International Conference on Computational Photography, pages 1–10, 2017.
-  J. Sun, W. Cao, Z. Xu, and J. Ponce. Learning a convolutional neural network for non-uniform motion blur removal. In Computer Vision and Pattern Recognition, pages 769–777, 2015.
-  L. Sun, S. Cho, J. Wang, and J. Hays. Edge-based blur kernel estimation using patch priors. In Proc. IEEE International Conference on Computational Photography, 2013.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. pages 1–9, 2014.
-  O. Whyte, J. Sivic, A. Zisserman, and J. Ponce. Non-uniform deblurring for shaken images. International Journal of Computer Vision, 98(2):168–186, 2012.
-  L. Xu, J. S. J. Ren, C. Liu, and J. Jia. Deep convolutional neural network for image deconvolution. In International Conference on Neural Information Processing Systems, pages 1790–1798, 2014.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. Computer Science, 2014.
Appendix A More Experiments Results
In this section, more detailed qualitative results are shown to evaluate the effectiveness of the proposed method. Comparison of resored results on LFW between six different methods including ours is shown in Figure 10. Figure 11 illustrates resored images on FaceScrub as well. Proposed method can handle various blur kernels and is adaptable at different face images.