# Fast and Robust Linear Motion Deblurring

###### Abstract

We investigate efficient algorithmic realisations for robust deconvolution of grey-value images with known space-invariant point-spread function, with emphasis on 1D motion blur scenarios. The goal is to make deconvolution suitable as preprocessing step in automated image processing environments with tight time constraints. Candidate deconvolution methods are selected for their restoration quality, robustness and efficiency. Evaluation of restoration quality and robustness on synthetic and real-world test images leads us to focus on a combination of Wiener filtering with few iterations of robust and regularised Richardson-Lucy deconvolution. We discuss algorithmic optimisations for specific scenarios. In the case of uniform linear motion blur in coordinate direction, it is possible to achieve real-time performance (less than 50 ms) in single-threaded CPU computation on images of pixels. For more general space-invariant blur settings, still favourable computation times are obtained. Exemplary parallel implementations demonstrate that the proposed method also achieves real-time performance for general 1D motion blurs in a multi-threaded CPU setting, and for general 2D blurs on a GPU.

∎

## 1 Introduction

Besides noise, blur is perhaps the most common type of degradation in a wide range of imaging modalities. Sources of blur vary widely, from atmospheric perturbations via optical aberrations, motion of both objects and imaging appliances down to defocussing. Common to all kinds of blur is that it interferes with the detection of localised image features, and thereby impedes interpretation of images, be it by humans or by automated systems.

It has therefore been a long-standing goal of image processing research to remove blur from so degraded images. For this task, deconvolution, numerous approaches have been designed that differ in the requirements they impose on the input data, the quality of results, and the computational effort. First work in the context of 1D signal processing goes back almost eighty years vanCittert-ZPhys33 (). Approaches studied since then range from fast linear filters and Fourier-based methods like the Wiener filter Wiener-Book49 () to non-linear iterative schemes derived from variational and/or statistical models Bar-scs05 (); Dey-MRT06 (); Krishnan-nips09 (); Lucy-AJ74 (); Richardson-JOSA72 (); Wang-SIIMS08 (); Welk-tr10 (). All sorts of deconvolution problems are highly ill-posed, and there is inevitably a trade-off between restoration quality and speed of computation.

#### Basic classification of deconvolution tasks.

In the classification of deconvolution problems, the most important dichotomy is that between non-blind and blind deconvolution. Whereas in the first case the blur is known, blind deconvolution assumes that only the observed image is available, and the blur is to be estimated along with the sharp image. As blind deconvolution is much more underconstrained than deconvolution with known blur, it involves higher computational cost and often generates poorer results.

A second distinction is that between space-invariant blur, in which the so-called point-spread function is the same for all locations in the image domain, and space-variant blur, where different pixels are smeared in different ways. In the space-invariant case, the blur operator reduces to a standard convolution, such that Fourier methods can be used, and also forward (blur) operations in iterative schemes can be computed by the Fast Fourier Transform.

#### Towards real-time deconvolution.

In this contribution, we consider robust deconvolution with known space-invariant blur under tight time constraints, with the aim to achieve real-time performance at least in some cases. We will therefore select deconvolution methods for robust deconvolution performance, and discuss measures to optimise their algorithmic efficiency.

Real-time deconvolution has also been considered in some recent papers, see Hirsch-iccv11 (); Klosowski-spie11 (). Both papers address the case of general 2D blur on the basis of the algorithm from Krishnan-nips09 (), and reach real-time performance when using GPUs with several hundred computing units under CUDA. In Hirsch-iccv11 (), even blind deconvolution is considered; however, the blind deconvolution case is far from real-time performance even when computing on a GPU.

Comparing to these contributions, however, our goal is different. We want to investigate under which conditions real-time performance can be achieved in single-threaded CPU computation. This was an essential requirement of the application context from which this work emerged. Additional work on multi-threaded CPU and GPU realisations is presented to put the results in context.

To accomplish this task, we are willing, in turn, to sacrifice some generality: We focus on setups in which some parameters can be chosen in a way such as to reduce the computational expense. For example, we accept the limitation to power-of-two image dimensions such as to profit best from Fast Fourier Transform. Also, we consider settings restricted to either one-dimensional blur, or even more specifically just linear motion blur, and assume herein that the camera system can be adjusted in such a way that the 1D blur is in the direction of a coordinate axis of the sensor.

#### Robustness.

On the other hand, with practical applications of deconvolution as goal, robustness is a crucial aspect for our investigation. The concept of robustness originates from statistics Huber-book81 () and refers to estimators that are as insensitive as possible to outliers and model violations. In the case of deconvolution, this includes not only strong noise (e.g., salt-and-pepper noise in Bar-scs05 ()) but also imprecise blur estimates, or errors near the image boundary caused by blurring across the boundary, see Welk-dagm05 ().

There is a long tradition of testing deconvolution algorithms with synthetically blurred images. Many papers, including Klosowski-spie11 (); Krishnan-nips09 (); You-icip96 (), use only test cases of this kind. Often also the noise levels being considered are fairly low (in the range of the quantisation noise already caused by discretising grey-values to integers in ). Such test scenarios have their merit because they allow to measure reconstruction error against a ground truth. However, it is particularly dangerous in the case of such a severely ill-posed problem like deconvolution to rely only on tests with synthetic blur. Methods that perform well on such data do often not live up to their promises when applied to real-world images whose blur or noise deviates slightly from the perfect theoretical model underlying the deconvolution approach.

#### Application context.

The motivation for the work presented in this paper is to devise a deconvolution framework that can be used as preprocessing step for further automated image processing tasks under real-time conditions within industrial processes. It is important to note that the embedding into an industrial process also limits the computational resources that are available for computation. While parallelisation on multicore CPUs and GPUs plays an ever-increasing role in current high-performance computing, an essential constraint of the application context that motivated the present work was that not more than one CPU kernel could be assigned to the image processing task.

The setting under consideration involved imaging moving objects by stationary cameras. The resulting motion blur is to be compensated by deconvolution in order to allow the detection and localisation of features on moving objects as well as recognition of shapes. To this end, a sufficient sharpening of edges and lines would be necessary. Ringing artifacts should be suppressed sufficiently in order to not create false detections. Finally, in order to be feasible together with subsequent processing tasks in real-time, deconvolution should not take more than about 50 ms of single-threaded computation on a contemporary standard CPU.

In addressing this problem, setups of different degree of generality were considered: firstly, blur by uniform linear motion; secondly, blur by non-uniform linear motion; and thirdly, a general 2D blur model. In the linear motion cases, it was assumed that the blur direction would be aligned with a coordinate axis of the imaging system by technical measures. In all cases, it was assumed that the imaging conditions would be sufficiently controllable to ensure that the blur would be known, so no blind deconvolution was considered.

## 2 Mathematical Models for Deconvolution

To describe deconvolution models that were investigated in order to accomplish the task outlined above, we start from the common spatially invariant blur model for grey-value images,

(1) |

where is the observed blurred image, is the unobservable sharp image, the (known) space-invariant point-spread function (PSF), and additive noise. By we denote the location in the image plane.

In the following, denotes the processed image that is to approximate . In iterative methods, we denote the -th iterate by . Some methods involve convolution with , the adjoint of . In our space-invariant setting, is just reflected at the origin, i.e. .

### 2.1 Wiener Filter

The Wiener filter Wiener-Book49 () filters the Fourier transform of to approximately invert the multiplication with that corresponds to the convolution with ; it reads

(2) |

with the filter parameter that dampens the amplification of those frequencies for which is zero or small. Note that , the complex conjugate of , is the Fourier transform of . Generally, the sharpening effect of the filter is the more pronounced, the smaller is chosen. However, smaller also implies stronger amplification of noise, such that larger must be used when strong noise is present. Using a Gaussian noise model, can be chosen dependent on the standard deviation of noise such as to minimise the mean square error of the approximation of by . As a single-step method, the Wiener filter is fast, but its applicability is limited since it can hardly cope with more severe noise, and like all linear methods tends to produce oscillatory over- and undershoots near contrasted structures, known as ringing artifacts.

### 2.2 Richardson-Lucy Deconvolution

One of the most time-proven and popular iterative deconvolution method is Richardson-Lucy (RL) deconvolution Lucy-AJ74 (); Richardson-JOSA72 ()

(3) |

The single parameter of this method is the number of iterations. Unlike the Wiener filter, RL requires and preserves positivity of grey-values, and is related to a Poisson noise model. The method features a semi-convergence behaviour: With increasing number of iterations, sharpness is improved, but at the same time noise is amplified and leads to divergence after an initial convergence phase.

### 2.3 Variational Models for Deconvolution

A larger class of iterative methods arises from variational models. In these, one aims at minimising energy functionals Bar-scs05 (); You-icip96 () that combine a so-called data term which penalises deviations from the blur model with a regulariser enforcing some smoothness constraint on the image. Such an energy functional can read as Bar-scs05 (); Welk-dagm05 ()

(4) |

where are non-decreasing differentiable functions. Both contributions are balanced by a positive weight parameter .

Besides the identity function , which corresponds to a quadratic error measure, a common choice for the penaliser function is the regularised error with a small , see e.g. Bar-scs05 (). The same type of function can be used for . The resulting total variation penaliser Rudin-PHYSD92 () is popular in image processing due to its favourable edge-preserving behaviour. In order to even enhance edges, one can use even penalisers that are non-convex w.r.t. , see Almeida-TIP10 (); Welk-TBW-scs05 (). Quadratic penalisation in the regulariser is rarely used nowadays in deblurring because of its blurring effect that directly counteracts the data term.

The objective function in Krishnan-nips09 (), motivated there from statistical considerations, is a discrete version of the energy functional

(5) |

which obviously amounts to an instance of (4) with the quadratic data penalty , and a non-convex power function as regulariser. With regard to the specific optimisation that is employed for minimisation in Krishnan-nips09 (), special values like are preferred. Also in Wang-SIIMS08 () an energy of this type is minimised but with total variation regularision, i.e. .

In Welk-tr10 (), the energy functional

(6) |

is introduced whose data term does not depend on but instead of the so-called information divergence

(7) |

A simpler version of the functional (6) can also be linked to the Richardson-Lucy method Snyder-TIP92 ().

#### Techniques for energy minimisation.

A gradient descent for (4) can be derived using the Euler-Lagrange framework. Discretising in time by an Euler forward scheme with time step size one arrives at the iteration

(8) |

whose update term combines a sharpening term (first summand) that acts to reduce the data term with a nonlinear diffusion term (second summand) with the diffusivity that enforces image regularity, compare Welk-dagm05 (). The functions are nonincreasing.

Unfortunately, gradient descent is a fairly slow procedure, and thus presently not a candidate for deconvolution under real-time or near-real-time conditions.

The half-quadratic optimisation technique used in Krishnan-nips09 (); Wang-SIIMS08 () is another approach to minimising an energy approach. The nonlinearity in the regulariser is decoupled from the minimisation of the data term by means of additional variables and a coupling term, weighted by an additional parameter . Following a continuation strategy, is successively increased during the computation. For details see Krishnan-nips09 (); Wang-SIIMS08 ().

For the energy functional (6) a positivity-preserving iterative minimisation scheme in the style of the RL method can be derived Elhayek-dagm11 (); Welk-tr10 (), which is called robust and regularised RL deconvolution (RRRL) and reads as

(9) |

where

(10) | ||||

(11) | ||||

(12) |

The numerical realisation of these methods will be considered in Section 4.1.

In theory, the different noise models underlying deconvolution models, such as Gaussian noise for the Wiener filter, Poisson noise for RL etc., render these models suitable for distinct application areas. In practical application, however, often not all sources of noise are known and controllable, such that it is unavoidable to apply deconvolution methods also to data that do not perfectly match their respective noise models. Indeed, robustness, as mentioned in the introduction, is all about models being capable of handling this kind of mismatch.

## 3 Evaluation of Deconvolution Quality

To assess the suitability of the before-mentioned approaches for our deconvolution task, we start by evaluating their restoration quality. Gradient descent is not considered further because of its slowness mentioned earlier. Apart from that, efficiency optimisation is still not the goal at this point. All tests in this section are therefore based on deconvolution implementations for general 2D blurs with Fourier convolutions.

We compare thus Wiener filter, Richardson-Lucy deconvolution (RL), the iterative methods by Wang et al. (WYYZ) Wang-SIIMS08 () and by Krishnan and Fergus (KF) Krishnan-nips09 (), and RRRL Elhayek-dagm11 (); Welk-tr10 ().

### 3.1 Synthetically Blurred Images

Our first test scenario is based on a version of the popular cameraman test image which has been synthetically blurred with a space-invariant “camera-shake”-like point-spread function of irregular shape, Fig. 1(a). No noise besides the quantisation noise is present in this test image.

Since the deconvolution methods are implemented with Fourier domain convolution, the blur in the test images has been carried out via the spatial domain in order to avoid what is known as an “inverse crime” Colton-Book92 (). The boundary condition in the convolution was chosen as constant continuation (along normals to the boundary) which contrasts to the periodic continuation implicitly involved in the Fourier convolution in the deblurring algorithms.

Table 1 collects signal-to-noise ratios

for blurred and deblurred images compared to the unperturbed cameraman image .

In Fig. 2, we test Wiener filter, KF and RRRL on three other synthetically blurred cameraman images to study the stability of results under additional noise and stronger blur. For signal-to-noise ratios see again Table 1. In Fig. 2(a–d), a fairly low amount of Gaussian noise is added to the test image. Subfig. (e–h) consider a spatially more extended point-spread function. A more drastic noise – impulse noise with density – is added to the first test image in Subfig. (i–l).

It is evident both visually and from the SNR figures that RRRL, KF, and WYYZ generally allow for a good restoration. The Wiener filter is sensitive to boundary artifacts for larger blurs, and to non-Gaussian noise. The robust data term of the RRRL model gives it also an advantage over the KF and WYYZ methods in settings with boundary artifacts and more severe noise.

###
3.2 Combined Wiener-RRRL Method (WR^{3}L)

It should be noted that in spite of its favourable properties RRRL still takes fairly many iterations (30…100, depending on noise level) to achieve an acceptable degree of sharpness. On the other hand, Wiener filtering, being a non-iterative method, provides a reasonable sharpness in one fast computation step but at the cost that the remaining noise and ringing artifacts are more pronounced.

This motivates us to test a combined approach in which the Wiener filter is used as a first step, followed by an RRRL iteration for which the Wiener result acts as initialisation. A caveat in doing so is that the Wiener filter output can contain zero or negative grey-values, which cannot be handled within the RRRL model. However, negative grey-values appear as part of artifacts anyway, so this can be countered simply by replacing all zero or negative grey-values to a small positive value in the input for RRRL.

Table 1 includes this method, WR^{3}L,
in the last column. Note that even with as few as 5 iterations
WR^{3}L performs in most cases comparable to about
30 iterations of pure RRRL. Visual evaluation of
WR^{3}L will be included in subsequent experiments.

### 3.3 Restoration Quality for Real-World Images

We turn now to real-world examples, taken under conditions similar to the industrial production context that our development is directed at. From here on, we have to rely on visual comparisons since a ground truth from which SNR could be computed is no longer available. Also, no exact knowledge on the noise distribution is available. On the other hand, the motion parameters determining the blur can be adjusted in these setups, such that the point-spread function is known. Moreover, the setting with objects being imaged before a uniform background largely removes boundary artifacts – in particular, periodic continuation is unproblematic in this case.

In Figs. 3 and 4 we present two real-world
test images with linear uniform motion blur.
Deblurring results using
Wiener filter, RL, KF, RRRL visually confirm the findings
from our synthetic blur experiments. The Wiener filter
allows for a visually favourable sharpening that can be equalled
by RL and RRRL only after about 30 iterations. However,
amplified noise and artifacts appear more prominent in the Wiener
filter result.
The combination WR^{3}L is studied in
Subfigures (f–h) of both figures. Visually, the difference between
the Wiener filter result, Fig. 3(b)/4(b),
and the result after additional 5 iterations of RRRL, Subfig. (f),
appears to be small. The exaggeration with 30 iterations
in Subfig. (g) and the difference image in Subfig. (h)
demonstrate that there is an improvement, though:
the RRRL
iterations keep the good initial sharpening of the Wiener filter,
while noise and artifacts are reduced.
Ringing artifacts and noise are slightly more prominent in the KF
result than in the RRRL or WR^{3}L results, which
underlines that robust data terms are indeed advantageous in real-world
data.

The effect of the combination WR^{3}L becomes even
more evident in Figure 5,
where the original image has been degraded by additional Gaussian
noise: Here, it is clear that the combination of Wiener filter
and RRRL combines the sharpness achieved by a single Wiener filter
step (but not by RRRL alone) with a visible noise reduction.

Summarising the observations of this section, the combination of Wiener filter with at least five subsequent iterations of RRRL provides a reasonable restoration quality. Our efficiency considerations will therefore focus on this method.

## 4 Efficient Implementation

In this section, we discuss aspects of the algorithmic realisation
of the deconvolution methods under consideration.
Wiener filter, RL, RRRL and the combined WR^{3}L
were implemented in three scenarios from general 2D blur down to
1D linear motion blur.

For WYYZ and KF only general 2D implementations were done to allow comparisons. Our strategies to derive more efficient 1D algorithms from the Wiener and RL filter family cannot be transferred straightforward to the WYYZ and KF algorithms due to the different structure of their iterations, in which the 2D regularisation is intertwined with the 2D Fourier iteration step.

While our focus is on single-threaded CPU computation, we did also multi-threaded CPU and GPU implementations of some variants, which will also shortly be introduced.

### 4.1 Numerics

The numerical realisation of all deconvolution methods discussed is mostly straightforward, specifications being necessary for the Fourier transform, convolution operations and the diffusion terms. In the case of RRRL, also the computation of the information divergence (7) deserves special attention.

#### Fourier transform.

For full control over implementation details, we used our own implementation of the Fast Fourier Transform (FFT) for power-of-two image dimensions, based on (Bronstein-Book79, , Par. 4.4.1.3), with adaptations to the real-valuedness of image data. The values of the complex exponential were precomputed once for all Fourier transforms during the program run.

#### Convolution.

Here we considered either realisations via the Fourier domain, or spatial-domain convolution by direct summation.

#### Diffusion terms.

These were discretised using standard central difference approximations.

#### Function evaluations.

In the RRRL iteration, the information divergence (7) is expensive to compute directly due to the logarithms. For this reason, values of in the range were stored in a lookup table. Values of were then calculated by , using a linear approximation for in . (In our test examples, the latter approximation did in fact not occur.)

In the KF(-S) algorithm the solution of a polynomial equation can be replaced by a lookup table. In Krishnan-nips09 (), speedups from about (for images) to (for large images) were obtained in this way. In our implementation, we stick to the slower analytic solution. However, by comparing with the WYYZ algorithm the possible speedup can be estimated.

### 4.2 Boundary Treatment

Since blur operations involve considerable transport of information across the image boundary, artifacts near the image boundary are an issue in deconvolution. These artifacts are especially strong when simple boundary conditions like zero-padding, constant or periodic continuation introduce massive model violations. Although advanced boundary treatment schemes are available that allow to reduce these artifacts considerably, see e.g. Chan-IJIST05 (), these are computationally expensive.

For performance reasons, we decide to use constant continuation for spatial convolution operations, and the natural periodic continuation for Fourier-based operations, and tolerate the resulting artifacts. Note that in the application setting of Figs. 3–5 where objects are photographed in front of uniform backgrounds, almost no artifacts are introduced anyway.

### 4.3 Algorithmic Optimisations

Concluding from the qualitative tests, it is desirable to perform Wiener filtering plus at least five iterations of RRRL for a practically useful deconvolution in the real-time environment. Concerning the image size, we restrict ourselves to powers of two in order to profit most from the Fast Fourier Transform (FFT). Depending on the exact algorithmic variant this restriction can be eased in application.

In our examples, an image size of is appropriate to include a suitable region of interest for object detection or localisation. For more limited applicability, also images of size may be considered.

We discuss now algorithmic optimisations that can be used to achieve the desired deconvolution in the so defined setting. We will distinguish herein between uniform linear motion and non-uniform linear motion, and compare both to a general 2D space-invariant blur situation.

#### General 2D setting.

It is clear that in the general 2D blur scenario, Wiener filtering requires a two-dimensional Fourier transform and inverse transform, which are implemented by FFT.

In the iterative algorithms (RL, RRRL), convolutions can either be computed by direct numerical integration in the spatial domain, or again via the Fourier domain. In the latter case, once more a 2D FFT is necessary. As the complexity of the spatial domain convolution is linear in the image size and PSF size, while the FFT implementation is log-linear in the image size only, spatial domain convolution may be superior to FFT for very small PSF but FFT will dominate for large convolution kernels.

#### Non-uniform linear motion.

In this case the blur is characterised by a point-spread function with 1D support. Assuming that it is aligned with a coordinate axis of the imaging device (say the vertical one), the Wiener filter needs to applied only in columns. Thus, a 1D FFT is sufficient, saving up to half the numerical cost of the Fourier transform.

Equally, only 1D convolutions are needed in the iterative method. If these are carried out in 1D only, the program logic is slightly simplified compared to the 2D case but the number of multiplications and additions in evaluating the integral will be comparable to that of a 2D implementation for equal PSF size. In contrast, a Fourier-based realisation will again profit from using 1D instead of 2D FFT.

#### Uniform linear motion.

The point-spread function in this case still is 1D, but it takes the special form of a box filter, i.e. it is constant throughout its support. (This applies exactly when the blur length is integer. In the case of a non-integer blur length, one has to allow for single pixels of smaller weight at one or both ends of the kernel.)

For the Wiener filter, this setting does not offer any specific advantage over the general 1D case. The same holds true for the convolutions in the iterative method when computed via the Fourier domain.

However, the spatial domain convolution can be made dramatically more efficient in this case by an efficient box filtering algorithm McDonnell-CGIP81 (). In the case of a vertical linear blur this algorithm works as follows: In each column, the convolution of the first pixel is computed by direct summation (complexity linear w.r.t. the kernel size). Then the sliding window is shifted one pixel at a time, such that one pixel enters the window while one pixel leaves it. So the sum inside the window is updated in constant time by adding one grey-value and subtracting another one. If non-integer blur length is admitted, the update step involves at most two additions and two subtractions. The overall complexity of the update part is therefore linear in the image size, and the total complexity of the algorithm with blur kernel size on an image of columns and rows amounts to .

### 4.4 Implementation and Technical Optimisation

All CPU programs were written in C, and compiled using gcc 4.6 with optimisation level O2. With O3 some run times were further reduced by a few percent while others even increased slightly. Also more specific compiler optimisation settings did not significantly improve performance. However, efficiency was improved by standard source code optimisation techniques like inlining, loop merging, blocking (for cache optimisation). Wherever possible, array access operations were avoided using auxiliary variables. One-dimensional arrays were used to store image data.

### 4.5 Parallelisation

While our focus is on single-threaded CPU computation, we consider two exemplary parallel implementations to assess the possible gains by parallelisation on multicore CPUs and GPUs.

#### Multi-threaded CPU implementation, 1D.

We implemented the version of RRRL for general 1D motion blur for multi-threaded computation on a multi-core CPU using the standard pthreads library. The Wiener filter was not parallelised in this setting because it contributes little to the overall run-time.

The sharpening term of the RRRL iteration nicely decomposes in this case in columns parallel to the point-spread function, and is easily parallelised in this way. In contrast, the smoothing term is still made up by 2D differential expressions, so the columns interact with each other in the derivative computation. Since the overall computational cost of the sharpening term equals several times that of the smoothing term, an almost balanced workload is achieved on CPUs with 4…8 cores if the regulariser is computed in the parent thread, while the sharpening term is distributed to the remaining cores in parallel threads.

To reduce the overhead of creating and terminating threads, the parallel worker threads for the sharpening term are started before the first RRRL iteration, and not terminated before all iterations are done. Mutexes are used to synchronise the updates between worker threads and parent thread in each iteration.

#### GPU implementation, 2D Fourier.

We also implemented the WR^{3}L algorithm for
general 2D point-spread functions using the CUDA 4.0 framework
for Nvidia GPUs. Both the Wiener filter and the convolutions
in the RRRL iteration are performed using CUDA’s built-in
Fast Fourier transforms. Efficiency of parallel access to
some data (namely, the point-spread function) was improved by
using texture memory.

## 5 Experiments

We measure the performance of deconvolution algorithms suitable for the three scenarios described in the previous section. For this, we used the gettimeofday() function since it states real-world run times within the context of the running system instead of pure process time, and easily allows measurements for any particular portion of an algorithm. The downside is that due to other activities (system processes etc.) run times will display considerable variation – we used a standard Linux system without any specific real-time scheduler. It is therefore necessary to consider statistics over many program runs. We report therefore averages, standard deviations and extremes from 100 subsequent program runs.

Not included in our measurements is the time for loading and storing images. The rationale for this is that in time-critical industrial applications, image data would anyway be transferred into the memory directly from the imaging device.

Also not included is the time for precomputing auxiliary data for the Fourier transform. This is based on the assumption that this can be done once for a large number of equally sized images to be processed in an application context.

### 5.1 Single-Threaded Wiener+RRRL

#### Comparison across 1D/2D blur scenarios.

We choose as our test case the image from Fig. 4 ( pixels) with the uniform linear motion blur kernel of length , for which all algorithms could equally be applied. We remark that for the algorithms chosen (box filter, Fourier convolution/Wiener filtering) the computational cost does not or only slightly depend on the size of the blur kernel. In all cases, we compute Wiener filtering plus five iterations of RRRL.

Run times were measured for single-threaded computation (thus using one core) on an AMD Phenom II X6 1100T running at 3.3 GHz. Statistics for the Wiener step, single RRRL iterations, total run time of the five RRRL iterations and overall time are given in Table 2. Note that the Fourier transform of the point-spread function is computed once in the Wiener filter step and re-used if necessary in the RRRL iterations.

From this table it is evident that the proposed filtering procedure can be carried out reliably in less than 50 ms on grey-value images with uniform linear motion blur. For the more general blur scenarios, the computational expense still prevents real-time performance in the single-threaded setting, although even here favourable run-times are achieved.

#### Different image sizes.

In Table 3 run times for the fastest algorithm (1D Wiener, RRRL with box filter) are shown for three different image sizes, all with the same blur kernel size. As can be seen, the measured run times for the RRRL iteration scale by a factor for each quadrupling of the image size, reflecting the essentially linear complexity. In contrast, averaged run times for the Wiener filter step (actually log-linear) multiply only by for each scale step.

This effect can be traced back to outliers with large execution times that occur in the Wiener filter step of some program runs and sometimes also in the first RRRL iterations. These outliers seem to be caused by cache effects and amount to an almost constant-time overhead on the average times, and thus the misleading impression of a sub-linear scaling of the Wiener filter step. Indeed, when for testing the Wiener filtering step is performed twice, the second step displays less outliers and thus a lower average. Similarly, the first RRRL iteration in each program run is slightly slower than the subsequent iterations (about 0.5 ms for , about 2 ms for ). The influence of the outliers is particularly pronounced for images, as indicated by the large standard deviations. It is evident that measurements for these short execution times are less reliable than for the larger images.

### 5.2 Comparison to WYYZ and KF

We test the KF and WYYZ algorithms on the same
test case as the WR^{3}L algorithms.
Results are shown in Table 4.
As these algorithms are implemented in the full 2D setting,
the proper WR^{3}L value for comparison comes from
the last row of Table 2.

The KF algorithm () with analytic solver is tested in the parameter regime recommended in Krishnan-nips09 () ( scaling in powers of from to , one inner iteration per level) which totals to 6 iterations. The resulting run-time is in good agreement with the original value of 0.7 s from Krishnan-nips09 () when compensating for the different CPU clocks and image dimensions.

For the WYYZ algorithm, we use the same parameter regime. Since WYYZ differs from KF just by having a cheap shrinkage step where the expensive polynomial equation is solved in KF, the so measured run times give a reliable lower bound to the run time that KF could reach with the lookup table instead of the polynomial solver. Note, however, that Wang-SIIMS08 () suggests finer scaling steps for and substantially more iterations per level.

It can be seen from these figures that WR^{3}L,
albeit not the fastest algorithm, is competitive in terms of
speed.

### 5.3 Parallel Implementations of Wiener+RRRL

#### Multicore CPU computation.

The gap between the run-times displayed in Table 2 for the general 1D setting for images of size and our goal of real-time computation is not very large. Therefore it is interesting to see whether this gap could be bridged by multithreaded computation on a recent multicore CPU.

We tested this on the Phenom X6 hexacore machine described in the previous subsection. In our parallelised implementation of the RRRL component the regularisation was computed in the main thread, and the sharpening terms distributed to five threads such that all six cores could be used. Run-time results are shown in the first row of Table 5. In our test setting with just five iterations, the effective speed-up factor for the RRRL computation is only about .

A closer look at the run-times of single iterations reveals that typically the first two to three iterations are slower than the following ones. In fact, the averages and standard deviations for the five iterations are , , , , and , respectively. It appears that the overhead caused by creating and managing threads chips away a lot of the possible gain in performance for such short computational tasks. Indeed, when running much more iterations, the average computation time per RRRL iteration stabilises in the range of . The workload is well balanced between cores, as indicated by the fact that top consistently shows CPU loads of about for the process.

In spite of the modest speed-up factor, the multicore computation
achieves the goal of performing WR^{3}L with 5 iterations
on pixels within .

#### GPU computation.

We tested our GPU implementation
of WR^{3}L for general 2D point-spread functions on an nVidia
GT-440 graphics card featuring 96 cores at a clock rate of
.
The net computation times of the Wiener filter and RRRL iterations, and the
entire WR^{3}L are shown in the second row of
Table 5.

Firstly, standard deviations of these figures are much lower than for the CPU computations. This can be attributed to the fact that there are almost no other processes in the system that exploit the computing capabilities of the graphics card, and could thus interfere with the computation.

Secondly, the time measurements show that GPU computation enables the general 2D deconvolution of -pixel images to be performed reliably under our real-time constraint (); even the computation for -pixel images appears feasible.

Real-time-capable GPU deconvolution for general 2D blurs has already been devised in Klosowski-spie11 (), compare also the non-blind deconvolution part in the framework of Hirsch-iccv11 (). Both papers use the KF algorithm Krishnan-nips09 () with the same parameter regime that we have used in Section 3. We juxtapose therefore our speed measurements with those reported in Klosowski-spie11 ().

The average total run-time from Table 5 amounts to about 9.16 megapixels per second of single-channel computation, i.e. about 58.9 pixels per core per clock cycles. Since our implementation is not applicable to arbitrary image sizes, the test case from Klosowski-spie11 () is the one where the image size optimally fits their algorithm. For this scenario, Klosowski-spie11 () reports about 13 megapixels per second (single-channel). The value refers to an nVidia GTX 260 graphics card with 192 cores (according to nVidia’s official specification; Klosowski-spie11 () states 216) and a clock, which means that about 54.5 pixels are processed per core per clock cycles. Given the neglection of other influence factors, this can only serve as a rough comparison, but it demonstrates that the computational efficiency of both approaches is in the same range.

With its slight advantage in restoration quality in a number of settings
demonstrated in Section 3, WR^{3}L lends itself
therefore as an attractive candidate also for general 2D deconvolution
under real-time conditions on the GPU.

Nevertheless, some words of care must be said. In tune with our decision to exclude loading and storing images from the time measurements, the transfer between main memory and graphics card memory is not contained in the mentioned run-time figures. However, including these data transfers does not substantially change the picture. In our example, these transfers total to about .

Let us also revisit our other exclusion, precomputation of auxiliary data for Fourier transforms. In our CPU implementation, inclusion of these computations would make the computations more expensive, but not dramatically so. In the GPU setting with CUDA’s built-in Fourier transform, the precomputation of Fourier transform plans is fairly expensive. We measured run-times of about for this step; however, it is open to some question whether this time is exaggerated by including some initialisation overhead. Of course, for the price of the expensive Fourier plan construction we get a degree of generality that is not with our specialised CPU implementation. Very likely, using a more flexible off-the-shelf Fourier transform package on the CPU would lead to a similar shift in balance between precomputation and actual image processing. At any rate, for the practical efficiency of the proposed GPU-based deconvolution it is crucial to ensure that precomputed data are retained and used for multiple images.

## 6 Summary and Outlook

We have demonstrated the design of an efficient and robust deconvolution algorithm for known space-invariant blur. By combining Wiener filtering as a first step with a small number of iterations of robust and regularised Richardson-Lucy deconvolution Welk-tr10 () a reasonable deconvolution quality is achieved at a fairly low computational expense. We improved this basic method by algorithmic optimisations for specific blur scenarios, in particular fast box filtering McDonnell-CGIP81 () for uniform linear motion blur. In this case, real-time performance was reached for moderate image sizes in single-threaded CPU computation. To our knowledge, there has been no comparable framework so far for CPU-based real-time image deconvolution, even if it is only in a specific setting.

Exemplary implementations demonstrated also that comparable real-time performance can also be achieved for general 1D blur by multi-threaded computation on a contemporary multi-core CPU, and for 2D blur using GPU computation.

Ongoing work is directed at further specific blur settings as well as a more systematic investigation of efficient parallel implementations. Also, the comparison with existing real-time-capable GPU-based deconvolution in terms of restoration quality and robustness deserves further consideration. We expect that by these efforts the applicability of deconvolution in automatisation, quality inspection and further application fields will be significantly improved.

## Acknowledgements

It is gratefully acknowledged that work on this project was funded by Standortagentur Tirol, Innsbruck, and done in cooperation with Datacon GmbH, Radfeld, and WESTCAM Projektmanagement GmbH, Mils.

## References

- (1) Almeida, M.S.C., Almeida, L.B.: Blind and semi-blind deblurring of natural images. IEEE Transactions on Image Processing 19(1), 36–52 (2010)
- (2) Bar, L., Sochen, N., Kiryati, N.: Image deblurring in the presence of salt-and-pepper noise. In: R. Kimmel, N. Sochen, J. Weickert (eds.) Scale Space and PDE Methods in Computer Vision, Lecture Notes in Computer Science, vol. 3459, pp. 107–118. Springer, Berlin (2005)
- (3) Bronstein, I.N., Semendyaev, K.A.: Taschenbuch der Mathematik. Teubner, Leipzig (1979)
- (4) Chan, T.F., Yip, A.M., Park, F.E.: Simultaneous total variation image inpainting and blind deconvolution. International Journal of Imaging Systems and Technology 15(1), 92–102 (2005)
- (5) Colton, D., Kress, R.: Inverse Acoustic and Electromagnetic Scattering Theory. Springer, Berlin (1992)
- (6) Dey, N., Blanc-Feraud, L., Zimmer, C., Roux, P., Kam, Z., Olivo-Marin, J.C., Zerubia, J.: Richardson-Lucy algorithm with total variation regularization for 3D confocal microscope deconvolution. Microscopy Research and Technique 69, 260–266 (2006)
- (7) Elhayek, A., Welk, M., Weickert, J.: Simultaneous interpolation and deconvolution model for the 3-D reconstruction of cell images. In: R. Mester, M. Felsberg (eds.) Pattern Recognition, Lecture Notes in Computer Science, vol. 6835, pp. 316–325. Springer, Berlin (2011)
- (8) Hirsch, M., Schuler, C.J., Harmeling, S., Schölkopf, B.: Fast removal of non-uniform camera shake. In: Proc. IEEE International Conference on Computer Vision, pp. 463–470. Barcelona (2011)
- (9) Huber, P.J.: Robust Statistics. Wiley, New York (1981)
- (10) Klosowski, J.T., Krishnan, S.: Real-time image deconvolution on the gpu. In: J.D. Owens, I.J. Lin, Y.J. Zhang, G.B. Beretta (eds.) Parallel Processing for Imaging Applications, Proceedings of SPIE, vol. 7872, pp. 1033–1041. SPIE Press, Bellingham (2011)
- (11) Krishnan, D., Fergus, R.: Fast image deconvolution using hyper-laplacian priors. In: Advances in Neural Information Processing Systems, pp. 1033–1041 (2009)
- (12) Lucy, L.B.: An iterative technique for the rectification of observed distributions. The Astronomical Journal 79(6), 745–754 (1974)
- (13) McDonnell, M.J.: Box-filtering techniques. Computer Graphics and Image Processing 17(1), 65–70 (1981)
- (14) Richardson, W.H.: Bayesian-based iterative method of image restoration. Journal of the Optical Society of America 62(1), 55–59 (1972)
- (15) Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992)
- (16) Snyder, D., Schulz, T.J., O’Sullivan, J.A.: Deblurring subject to nonnegativity constraints. IEEE Transactions on Image Processing 40(5), 1143–1150 (1992)
- (17) van Cittert, P.H.: Zum Einfluß der Spaltbreite auf die Intensitätsverteilung in Spektrallinien. II. Zeitschrift für Physik 65, 298–308 (1933)
- (18) Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences 1(3), 248–272 (2008)
- (19) Welk, M.: Robust variational approaches to positivity-constrained image deconvolution. Tech. Rep. 261, Department of Mathematics, Saarland University, Saarbrücken, Germany (2010)
- (20) Welk, M., Theis, D., Brox, T., Weickert, J.: PDE-based deconvolution with forward-backward diffusivities and diffusion tensors. In: R. Kimmel, N. Sochen, J. Weickert (eds.) Scale Space and PDE Methods in Computer Vision, Lecture Notes in Computer Science, vol. 3459, pp. 585–597. Springer, Berlin (2005)
- (21) Welk, M., Theis, D., Weickert, J.: Variational deblurring of images with uncertain and spatially variant blurs. In: W. Kropatsch, R. Sablatnig, A. Hanbury (eds.) Pattern Recognition, Lecture Notes in Computer Science, vol. 3663, pp. 485–492. Springer, Berlin (2005)
- (22) Wiener, N.: Extrapolation, Interpolation and Smoothing of Stationary Time Series with Engineering Applications. MIT Press, Cambridge, MA (1949)
- (23) You, Y.L., Kaveh, M.: Anisotropic blind image restoration. In: Proc. 1996 IEEE International Conference on Image Processing, vol. 2, pp. 461–464. Lausanne, Switzerland (1996)