Maximum Entropy on the Mean: a Paradigm Shift for Regularization in Image Deblurring
Abstract.
Image deblurring is a notoriously challenging illposed inverse problem. In recent years, a wide variety of approaches have been proposed based upon regularization at the level of the image or on techniques from machine learning. We propose an alternative approach, shifting the paradigm towards regularization at the level of the probability distribution on the space of images. Our method is based upon the idea of maximum entropy on the mean wherein we work at the level of the probability density function of the image whose expectation is our estimate of the ground truth. Using techniques from convex analysis and probability theory, we show that the method is computationally feasible and amenable to very large blurs. Moreover, when images are imbedded with symbology (a known pattern), we show how our method can be applied to approximate the unknown blur kernel with remarkable effects. While our method is stable with respect to small amounts of noise, it does not actively denoise. However, for moderate to large amounts of noise, it performs well by preconditioned denoising with a state of the art method.
1. Introduction
Illposed inverse problems permeate the fields of image processing and machine learning. Prototypical examples stem from the nonblind (deconvolution) and blind deblurring of digital images. The vast majority of methods for image deblurring are based on some notion of regularization (eg. gradient based) at the image level. Motivated by our previous work [19] for bar codes, we address general image deblurring at the level of the probability density function of the ground truth. Using KullbackLeibler divergence as our regularizer, we present a novel method for both deconvolution and kernel (point spread function) estimation via the expectation of the probability density with maximum entropy. This higherlevel approach is known in information theory as maximum entropy on the mean and dates back to E.T. Jaynes in 1957 [8, 9]. Our approach is made computationally tractable as a result of two observations:

FenchelRockafellar duality transforms our infinitedimensional primal problem into a finitedimensional dual problem;

The sought expectation of the maximal probability distribution can be simply written in terms of known moment generating functions and the optimizer of the dual problem.
What is particularly remarkable about our higherlevel method is that it effectively restores images that have been subjected to significantly greater levels of blur than previously considered in the literature. While the method is stable with respect to small amounts of noise, it does not actively denoise; however, for moderate to large amounts of noise, it can readily be preconditioned by first applying expected patch log likelihood (EPLL) denoising [28].
We test and compare our method on a variety of examples. To start, we present an example of simple deconvolution (without additive noise) but for a range of blurs for which previous methods do not even converge (cf. Fig. 1). Then we consider deconvolution with small to significant additive noise. We show that we can precondition with EPLL denoising to attain deconvolution results comparable with the state of the art (cf. Fig. 3). We then address blind deblurring with the inclusion of a known shape (analogous to a finder pattern in a QR bar codes [19]). In these cases, we can, preconditioning with EPLL denoising, blindly deblur with large blurs (cf. Figs. 4, 5, 6). Given that our method relies on symbology, comparison with other methods is unfair (in our favour). However, we do provide comparison with the state of the art method of Pan et al. [16, 15] to demonstrate the power of our method in exploiting the symbology (finder pattern) to accurate recover the blur (point spread function).
Overall, we introduce a novel regularization method which is theoretically wellfounded, numerically tractable, and amenable to substantial generalization. While we have directly motivated and applied our higherlevel regularization approach to image deblurring, we anticipate that it will also prove useful in solving other illposed inverse problems in computer vision, pattern recognition and machine learning.
2. Current Methods
The process of capturing one channel of a blurred image from a ground truth channel is modelled throughout by the relation , where denotes the 2dimensional convolution between the kernel () and the ground truth; this model represents spatially invariant blurring. For images composed of more than one channel, blurring is assumed to act on a perchannel basis. We therefore derive a method to deblur one channel and apply it to each channel separately.
Current blind deblurring methods consist of solving
(1) 
where serves as a regularizer which permits the imposition of certain constraints on the optimizers. This idea of regularization to solve illposed inverse problems dates back to Tikhonov [26]. Approaches that are not based on machine learning differ mostly on the choice of regularizer, examples include regularization, which penalizes the presence of nonzero pixels in the image or gradient [15]; weighted nuclear norm regularization, which ensures that the image or gradient matrices have low rank [18] and regularization of the dark channel, which promotes sparsity of a channel consisting of local minima in the intensity channel [16]. As it pertains to machine learning methods, other approaches have been employed including modelling the optimization problem as a deep neural network [24] and estimating the ground truth image from a blurred input without estimating the kernel using convolutional neural networks [13, 14, 25] or generative adversarial networks [11, 17].
The results achieved in these papers are comparable to the state of the art. However, to our knowledge, such methods have not been successfully applied to the large blurring regimes considered in this paper. Our method presents a paradigm shift in image deblurring by optimizing at the level of the set of probability densities rather than the set of images.
3. Our Method
3.1. KullbackLeibler Regularized Deconvolution
We first describe the method for the case of deconvolution and begin by establishing some notation and definitions. The convolution operator will be denoted by the matrix acting on a vectorized image for and resulting in a vectorized blurred image for which the coordinate in corresponds to the pixel of the image.
The expectation of a random vector on associated with a probability measure , the set of probability measures on , is a vector whose component is given by .
Letting , we write to denote that is absolutely continuous with respect to , i.e. if is such that , then . Notably, if is in turn absolutely continuous with respect to the Lebesgue measure then and . In this context, we denote by the KullbackLeibler divergence between and , i.e.
(2) 
Given and compact, we wish to determine the optimizer of
(3) 
In this framework, the expectation of this optimizer is taken to be the estimate of the ground truth. Throughout, will be chosen to admit a density with respect to the Lebesgue measure and to admit a density .
In its current form Eq. 3 is an infinitedimensional optimization problem with no obvious solution and is thus intractable. In the sequel, a corresponding finitedimensional dual problem will be established which will, along with a method to recover the expectation of solutions of Eq. 3 from solutions of this dual problem, permit a relatively fast and remarkably accurate estimation of the original image.
3.2. Dual Problem
In order to derive the (FenchelRockafellar) dual problem to Eq. 3 we provide the reader with the FenchelRockafellar duality theorem in a form expedient for our study, cf. e.g. [27, Cor. 2.8.5].
Theorem 1 (FenchelRockafellar Duality Theorem).
Let and be locally convex spaces and let and denote their dual spaces. Moreover, let and be convex, lower semicontinuous
(4) 
with denoting the adjoint of and denoting the convex conjugate of (cf. [27, Eq. 2.30]). Moreover, if is optimal in the maximization problem then
is optimal in the minimization problem.
Proof.
For the general statement and proof thereof, see [27, Cor. 2.8.5]. The theorem has been particularized to our application by using the following properties:

If is convex, proper and lower semicontinuous, then if and only if [27, Thm. 2.4.4 (iv)].

If is convex and proper then minimizes if and only if [27, Thm. 2.5.7].
∎
The equation in the last line of the theorem is known as the primaldual recovery formula.
A particularly useful case of this theorem is when is an operator between an infinitedimensional locally convex space and , as the dual problem will be a finitedimensional maximization problem. Moreover, the primaldual recovery is easy if is differentiable, in which case the subdifferential equals the gradient. Some points must be clarified before we can apply this theorem.
First, is not a locally convex space, but is a subset of the Banach space of regular complex Borel measures . The Riesz representation theorem [23, Thm. 6.19] states that the dual space of is the space . Furthermore, the corresponding duality pairing is given by .
The constraint that in Eq. 3 can therefore be enforced by optimizing over and adding the following indicator function
(5) 
to the objective function.
In the spirit of Thm. 1, we define
(6) 
The convex conjugates of these functions are provided in the following two lemmas.
Lemma 1.
The function defined in Eq. 6 is proper, lower semicontinuous and strictly convex. Moreover, is given by the mapping and for any , for
(7) 
we have .
Proof.
It is trivial to check that is nonempty, closed and convex; hence its indicator function is proper, lower semicontinuous and convex. By [5, Thm. 3.2.17] is proper and lower semicontinuous on . It is furthermore strictly convex since for every taking with densities and , letting , one has
where the strict log sum inequality [4, Thm. 2.7.1] has been used since and cannot simultaneously be equal to and thus equality cannot hold. Integrating on both sides demonstrates strict convexity of the divergence, so is strictly convex.
The convex conjugate is computed as follows
Since is compact, , as is bounded and integrates to hence . Thus, and since is a convex function, Jensen’s inequality [23, Thm. 3.3] yields
(8) 
This upper bound is attained by the probability measure
(9) 
which is admissible in the optimization problem defining the convex conjugate; hence
is indeed the convex conjugate of . One notes that is the unique maximizer since is strictly convex [27, Prop. 2.5.6]. ∎
Lemma 2.
The function defined in Eq. 6 is proper, convex and continuous on . Its convex conjugate is given by .
Proof.
Finally, we demonstrate that the expectation operator is of the proper form to apply the duality theorem.
Lemma 3.
Let be compact, then the operator is linear and continuous, moreover its adjoint is given by the mapping .
Proof.
Let , linearity follows trivially from the definition of this functional. Next, continuity can be established as follows
(10) 
where the constant is given by . Thus, this operator is bounded and hence continuous. Letting be arbitrary, the adjoint can be computed as follows
(11) 
thus . ∎
We can now state the main duality result.
Theorem 2.
Proof.
The dual problem can be obtained by applying the FenchelRockafellar duality theorem (Thm. 1) with and defined in Eq. 6 to the following primal problem
and substituting the expressions obtained in Lem. 1, 2 and 3, keeping in mind that the adjoint of evaluated at is the same as the adjoint of evaluated at .
The relevant conditions to apply this theorem are satisfied, as and are Banach spaces and all conditions on the functions were verified save for the condition on the intersection of domains. To this effect, it was noted in Lem. 2 that the norm term is continuous on and in Lem. 1 that the divergence term is proper. Hence the relevant intersection is nonempty and the norm term will be continuous on this set. Finally, the primaldual recovery formula of Thm. 1 is given explicit form by Lem 1, as it was shown that the unique element of is
Thus, evaluating establishes the claim. ∎
The utility of the dual problem is that it permits a staggering dimensionality reduction, passing from an infinitedimensional problem to a finitedimensional one. Moreover, the form of the dual problem makes precise the role of in Eq. 3. Notably in [2, Cor. 4.9] the problem
(14) 
is paired in duality with Eq. 12. Thus the choice of is directly related to the fidelity of to the blurred image. The following section is devoted to the choice of a prior and describing a method to directly compute from a solution of 12.
3.3. Probabilistic Interpretation of Dual Problem
If no information is known about the original image, the prior is used to impose box constraints on the optimizer such that its expectation will be in the interval and will only assign nonzero probability to measurable subsets of this interval. With this consideration in mind, the prior distribution should be the distribution of the random vector with the denoting independent random variables with uniform distributions on the interval . If the pixel of the original image is unknown, we let for small in order to provide a buffer for numerical errors.
However, if the pixel of the ground truth image was known to have a value of , one can enforce this constraint by taking the random variable to be distributed uniformly on . Constructing in this fashion guarantees that its support (and hence ) is compact.
To deal with the integrals in Eq. 12 and Eq. 13 it is convenient to note that (cf. [21, Sec. 4.4])
the momentgenerating function of evaluated at . Since the are independently distributed, [21, Sec. 4.4] and since the are uniformly distributed on one has
and therefore the dual problem with this choice of prior can be written as
(15) 
An optimizer of Eq. 15 can be determined using a number of standard numerical solvers. We opted for the implementation [3] of the LBFGS algorithm due to its speed and efficiency.
Since only the expectation of the optimal probability measure for Eq. 3 is of interest, we compute the component of the expectation of the optimizer provided by the primaldual recovery formula Eq. 13 via
Using the independence assumption on the prior, we obtain
such that the best estimate of the ground truth image is given by
3.4. Blind Deblurring
In order to implement blind deblurring on images that incorporate a symbology, one must first estimate the convolution kernel responsible for blurring the image. This step can be performed by analyzing the blurred symbolic constraints. We propose a method that is based on the entropic regularization framework discussed in the previous sections.
In order to perform this kernel estimation step, we will use the same framework as Eq. 3 with playing the role of . In the assumption that the kernel is of size , we take for small (again to account for numerical error) and consider the problem
(17) 
Here, is a parameter that enforces fidelity and are taken to be the segments of the original and blurred images which are fixed by the symbolic constraints. By analogy with Eq. 3, the expectation of the optimizer of Eq. 17 is taken to be the estimated kernel. The role of is to enforce the fact that the kernel should be normalized and nonnegative (hence its components should be elements of ). Hence we take its distribution to be the product of uniform distributions on . As in the nonblind deblurring step, the expectation of the optimizer of Eq. 17 can be determined by passing to the dual problem (which is of the same form as Eq. 15), solving the dual problem numerically and using the primaldual recovery formula Eq. 16. A summary of the blind deblurring algorithm is compiled in Algorithm 1.
This method can be further refined to compare only the pixels of the symbology which are not convolved with pixels of the image which are unknown. By choosing these specific pixels, one can greatly improve the quality of the kernel estimate, as every pixel that was blurred to form the signal is known; however this refinement limits the size of convolution kernel which can be estimated.
4. Results
We present results obtained using our method on certain simulated images. We begin with deconvolution, i.e. when the blurring kernel is known. Fig. 1 demonstrates the power of this method. Indeed, from even a highly blurred image, a near perfect estimate of the original image can be obtained provided the fidelity parameter is sufficiently large (in practice yields good results). Fig. 3 provides an example in which a blurry and noisy image has been deblurred using the nonblind deblurring method. We note that the method does not actively denoise blurred images, so a preprocessing step consisting of expected patch log likelihood (EPLL) denoising [28] is first performed. The resulting image is subsequently deblurred and finally TV denoising [22] is used to smooth the image. Note that for binary images such as text, TV denoising can be replaced by a thresholding step. Fig. 2 demonstrates the effects of different magnitudes of noise on a uniformly colored image.
Results for blind deblurring are compiled in Figs. 4 and 6. In this case and provide good results in the noiseless case and is adequate for the noisy case, but these parameters require manual tuning to yield the best results.
4.1. The Effects of Noise
In the presence of additive noise, attempting to deblur images using methods that are not tailored for noise is generally ineffective. Indeed, the image acquisition model is replaced by where denotes the added noise. The noiseless model posits that the captured image should be relatively smooth due to the convolution, whereas the added noise sharpens segments of the image randomly, so the two models are incompatible. However, Figs. 3 and 4 show that our method yields good results in both deconvolution and blind deblurring when a denoising preprocessing step and a smoothing postprocessing step are utilized.
Remarkably, the blind deblurring method is more robust to the presence of additive noise in the blurred image. Indeed, accurate results were obtained with up to Gaussian noise in the blind case whereas in the nonblind case, quality of the recovery diminished past Gaussian noise. This is due to the fact that the preprocessing step fundamentally changes the blurring kernel of the image. We are therefore attempting to deconvolve the image with the wrong kernel, thus leading to aberrations. On the other hand, the estimated kernel for blind deblurring is likely to approximate the kernel modified by the preprocessing step, leading to better results. Moreover, a sparse (Poisson) prior was used in the kernel estimate for the results in Fig. 4 so as to mitigate the effects of noise on the symbology.
Finally, we note that there is a tradeoff between the magnitude of blurring and the magnitude of noise. Indeed, large amounts of noise can be dealt with only if the blurring kernel is relatively small and for large blurring kernels, only small amounts of noise can be considered. This is due to the fact that for larger kernels, deviations in kernel estimation affect the convolved image to a greater extent than for small kernels.
4.2. The Role of the Prior
The choice of prior is not limited to a product of uniform distributions. Indeed, depending on the situation, specialized priors can be constructed in order to better capture the underlying properties exhibited by the image. We provide as an example the use of Bernoulli priors to model binary data and of Poisson priors to model sparsity. Fig. 7 presents a comparison of deblurring a binary text images using different priors with the same choice of . In this case, sparsity has been used to promote the presence of a white background by inverting the colors of the channel during the deblurring process.
5. Discussion
It is surprising that inference schemes of the type considered in this paper have not yet been applied to image deblurring. Indeed, the principle of maximum entropy was first developed in a pair of papers published by E.T. Jaynes in 1957. Furthermore, the theory of FenchelRockafellar duality is wellestablished in the convex analysis literature and has found applications to solving maximum entropy estimation problems (cf. [6]).
Since our algorithm models blur as the convolution of the clean image with a single unknown blur kernel, it relies crucially on the spatial uniformity of the blur. This assumption may not hold in certain cases. For example, an image captured by a steady camera that contains a feature that moves during image capture will exhibit nonuniform motion blur. It may be of interest to explore extensions of this algorithm that divide the observed image into patches and estimate different blur kernels for each patch (cf. the motion flow approach proposed in [7])
Finally, our method is flexible with respect to the choice of prior. Indeed, any distribution with a tractable moment generating function may serve as a prior for the KullbackLeibler divergence term in our primal problem. It would be interesting to consider using priors that encode global information regarding the latent image. For example, one might construct a classspecific prior that assigns high likelihood to images within a particular class (such as face images) and low likelihood to all other images (cf. [1]). A Gaussian mixture model trained on a particular class of images may serve as a useful prior for such purposes.
Implementation Details
All figures were generated by implementing the methods in the Python programming language using the Jupyter notebook environment. Images were blurred synthetically using motion blur kernels taken from [12] as well as Gaussian blur kernels to simulate out of focus blur. The relevant convolutions are performed using fast Fourier transforms. Images that are not standard test bank images were generated using the GNU Image Manipulation Program (GIMP), moreover this software was used to add symbolic constraints to images that did not originally incorporate them. All testing was performed on a laptop with an Intel i54200U processor. The running time of this method depends on a number of factors such as the size of the image being deblurred, whether the image is monochrome or color, the desired quality of the reproduction desired (controlled by the parameter ) as well as the size of the kernel and whether or not it is given. If a very accurate result is required, these runtimes vary from a few seconds for a small monochrome text image blurred with a small sized kernel to upwards of an hour for a highly blurred color image.
Footnotes
 For convex functions, lower semicontinuity is equivalent to weak lower semicontinuity (cf. [27, Thm. 2.2.1])
References
 (2019) Image Deblurring with a ClassSpecific Prior. IEEE Trans. on Pattern Anal. Mach. Intell. 41, pp. 2112–2130. Cited by: §5.
 (1992) Partially finite convex programming, Part I: Quasi relative interiors and duality theory. Math.l Program. 57, pp. 15–48. Cited by: §3.2.
 (1995) A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J. Sci. Comput. 16, pp. 1190–1208. Cited by: §3.3.
 (2006) Elements of Information Theory. Wiley. Cited by: §3.2.
 (1989) Large Deviations. Academic Press. Cited by: §3.2.
 (2007) Maximum Entropy Density Estimation with Generalized Regularization and an Applicationto Species Distribution Modeling. Journal of Machine Learning Research 8, pp. 1217–1260. Cited by: §5.
 (201612) From motion blur to motion flow: a deep learning solution for removing heterogeneous motion blur. CPVR 2016, pp. . Cited by: §5.
 (195705) Information Theory and Statistical Mechanics. Phys. Rev. 106, pp. 620–630. Cited by: §1.
 (195710) Information Theory and Statistical Mechanics. II. Phys. Rev. 108, pp. 171–190. Cited by: §1.
 (2009) Fast image deconvolution using hyperlaplacian priors. NIPS 2009, pp. 1033–1041. Cited by: Figure 1, Figure 3.
 (2018) DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks. CVPR 2018, pp. 8183–8192. Cited by: §2.
 (2009) Understanding and Evaluating Deconvolution Algorithms. CVPR 2009, pp. 1964–1971. Cited by: Implementation Details.
 (2017) Deep Multiscale Convolutional Neural Network for Dynamic Scene Deblurring. CVPR 2017, pp. 3883–3891. Cited by: §2.
 (2017) Motion Deblurring in the Wild. GCPR 2017, pp. 65–77. Cited by: §2.
 (2017) Regularized Intensity and Gradient Prior for Deblurring Text Images and Beyond. IEEE Trans. Pattern Anal. Mach. Intell. 39 (2), pp. 342–355. Cited by: §1, §2, Figure 4.
 (2016) Blind Image Deblurring Using Dark Channel Prior. CVPR 2016, pp. 1628–1636. Cited by: §1, §2.
 (2017) Deep Generative Filter for Motion Deblurring. ICCV 2017, pp. 2993–3000. Cited by: §2.
 (2016) Image Deblurring via Enhanced LowRank Prior. IEEE Trans. Image Process. 25 (7), pp. 3426–3437. Cited by: §2.
 (2019) Blind Deblurring of Barcodes via KullbackLeibler Divergence. IEEE Trans. Pattern Anal. Mach. Intell., in press, pp. . Note: doi: \url10.1109/TPAMI.2019.2927311 Cited by: §1, §1.
 (2009) Variational Analysis. Springer. Cited by: §3.2.
 (2001) An Introduction to Probability and Statistics. Wiley. Cited by: §3.3.
 (1992) Nonlinear total variation based noise removal algorithms. Physica D 60, pp. 259â268. Cited by: §4.
 (1987) Real and Complex Analysis. McGrawHill. Cited by: §3.2, §3.2.
 (2015) Learning to Deblur. IEEE Trans. Pattern Anal. Mach. Intell. 38 (7), pp. 1439–1451. Cited by: §2.
 (2018) ScaleRecurrent Network for Deep Image Deblurring. CVPR 2018, pp. 8174–8182. Cited by: §2.
 (1977) Solutions of IllPosed Problems. Winston and Sons. Cited by: §2.
 (2002) Convex Analysis in General Vector Spaces. World Scientific. Cited by: item 1, item 2, §3.2, §3.2, §3.2, Theorem 1, footnote 1.
 (2011) From Learning Models of Natural Image Patches to Whole Image Restoration. ICCV 2011, pp. 479–486. Cited by: §1, Figure 1, §4.