# Maximum Likelihood Mosaics

###### Abstract

The majority of the approaches to the automatic recovery of a panoramic image from a set of partial views are suboptimal in the sense that the input images are aligned, or registered, pair by pair, e.g., consecutive frames of a video clip. These approaches lead to propagation errors that may be very severe, particularly when dealing with videos that show the same region at disjoint time intervals. Although some authors have proposed a post-processing step to reduce the registration errors in these situations, there have not been attempts to compute the optimal solution, i.e., the registrations leading to the panorama that best matches the entire set of partial views. This is our goal. In this paper, we use a generative model for the partial views of the panorama and develop an algorithm to compute in an efficient way the Maximum Likelihood estimate of all the unknowns involved: the parameters describing the alignment of all the images and the panorama itself.

Image alignment/registration, mosaics, panoramic imaging, featureless methods, Maximum Likelihood.

Permission to publish this abstract separately is granted.

## I Introduction

Very good paper to read. Technically sound, very clear, and very good results. My only minor comment is that the statement that no attempt has been made to find the optimal solution is not true. There are many papers tackling this issue, e.g. work of Pollefeys, zisserman’s group at oxford, some work at INRIA by various groups, the work of Peleg et al., etc.

In this paper, we address the problem of recovering, in an automatic way, a panoramic image, or a mosaic, from a set of uncalibrated partial views, e.g., a set of video frames. Modern digital video systems demand efficient solutions for this problem, e.g., for image stabilization [5, 12] and content-based representations [1]. Other application fields include virtual reality and remote sensing. The key step to the success of the automatic mosaic building is the accurate registration, or alignment, of the input images.

### I-a Related work

Although some authors have approached the registration problem using classical signal processing techniques, such as Fourier transforms [15], or current image analysis tools, such as integral projections [10], the majority of the papers in the literature are mostly distinguished by either requiring a low-level pre-processing step (feature-based methods) or attempting to register the images directly from their intensity levels (featureless methods).

Feature-based methods, e.g., [6], align the images by first detecting and matching a set of pointwise features. Since reliable feature points must correspond to sharp intensity corners [16, 3], this first step is hard to accomplish in a fully automatic way when processing real videos, particularly when the images are noisy, have low texture, or exhibit a small overlap among them.

In opposition, featureless methods are optimal, in the sense that they estimate the registration parameters by minimizing the difference between the image intensities in a large region, thus leading to more robust solutions to the registration of a pair of views, e.g., [11, 13]. However, when building a panorama from a large set of images, practitioners usually register them sequentially, one at a time. This leads to propagation errors that may be become visually noticeable if non-consecutive images cover the same region of the panorama, which is common in applications such as seabed mapping. Although some authors proposed to post-process the registration parameters to deal with this problem [11, 7], there have not been attempts to generalize the highly successful featureless methods to the multi-frame case.

### I-B Proposed approach: featureless global estimation

The robustness of the featureless approaches to the registration of two views motivated us to develop a featureless method to align a larger set of frames. However, it is not obvious how the two-frame cost function, usually the sum of the image square differences [11, 13], should be generalized to the multi-frame case. We were able to derive the appropriate cost function, which is an original contribution of this paper, by including as unknown, jointly with the registration parameters, the panoramic image itself.

Our approach in this paper is then to formulate the automatic recovery of mosaics from a set partial views, as a classical parameter estimation problem. The input images are modelled as noisy observations of limited regions of the unknown panorama. Naturally, since the images are uncalibrated, the problem includes as unknowns the parameters describing the registration, or alignment, of the entire set of input images. We then use Maximum Likelihood (ML) estimation. To minimize the ML cost with respect to the large set of unknowns, we propose an efficient method. First, we derive the closed-form solution for the estimate of the panorama in terms of the other unknowns (the registration parameters). Then, we plug-in the estimate of the panorama into the ML cost, obtaining an error function that depends on the registration parameters alone. This error function is a weighted sum of the square differences between all possible pairs of input images. We derive a gradient-descent algorithm to minimize this cost.

### I-C Paper organization

The remaining of the paper is organized as follows. In section II, we formulate the registration of multiple images as a classical estimation problem. Section III deals with ML estimation for this problem, i.e., it introduces the Maximum Likelihood Mosaics (MLM) approach. In section IV, we develop MLM, using the simpler case of registering a pair of images. We contrast MLM with minimizing the registration error over a fixed window, as usually done in current featureless approaches. Section V generalizes MLM to the multi-frame registration. In section VI, we derive the gradient-descent algorithm to minimize the ML cost. Section VII describes experiments and section VIII concludes the paper. Preliminary versions of parts of this work are in [13, 14].

## Ii Problem Formulation

In this section, we develop a generative model for the partial views of an unknown panorama, and use ML to derive the estimation criterion that will allow us to recover the observed panorama, as well as the registration parameters, i.e., the viewing positions.

### Ii-a Generative model

We model each pixel of each image , as a noisy sample of the panorama . For simplicity, we consider the image domain to be the entire plane and, to take care of the limited field of view, we define a window as in the region observed in the images and in the regions outside the camera field of view. The observation model is then

(1) |

where denotes the noise, assumed i.i.d. zero-mean Gaussian, are the image coordinates , expressed in the coordinate system of the generic image , and are the corresponding coordinates of the panoramic image , expressed in its own coordinate system (which we will refer to as the reference coordinate system). Image models related to (1) have been used in the context of segmenting and tracking moving objects in video sequences [2, 8].

The reference coordinate system and the coordinate system of any of the images are related by a generic parametric mapping

(2) |

The parameter vector in (2) determines thus the mapping between each pixel of the panorama, with coordinates , expressed in the reference coordinate system, with the corresponding pixel of image , with coordinates . Common parameterizations include translation (2 degrees of freedom (dof)), rotation (1 dof), rigid motion (3 dof), translation+rotation+zoom (4 dof), affine (6 dof), and the projective, or homography (8 dof), see, e.g., [11, 9]. Although our derivations are intentionally left fully generic, in the experiments, we have used the affine mapping.

### Ii-B Estimation criterion

Given a set of images, , our goal is to recover all the unknowns involved: the panorama and the set of parameter vectors that define the viewing positions. We use ML. From the observation model (1), after simple manipulations, we express the symmetric of the log-likelihood function as

where is the number of pixels in each image and is the variance of the observation noise.

## Iii Maximum Likelihood Mosaics

To compute the ML estimate of all the unknowns, i.e., to carry out the minimization of the ML cost, given by the symmetric log-likelihood (II-B), with respect to (wrt) , we start by noticing that the estimate of the panorama can be expressed in closed-form as a function of the remaining unknowns.

We derive the expression for the ML estimate of the panorama by minimizing (II-B) wrt a generic pixel value . By making zero the derivative of (II-B) wrt , the estimate at pixel is easily obtained as a function of the set of unknown registration parameters, which we will compactly denote by :

(3) |

This expression shows that the estimate of the intensity of each pixel of is given by the average of the intensities of the corresponding pixels of all the input images that captured , i.e., all the images for which .

## Iv Two-frame Registration

### Iv-a Cost-function

### Iv-B Minimization algorithm

It is now clear that it is not possible to evaluate the error in (6) for every pair of values of and . This is the main problem of using a fixed window —registration is only possible when the overlapping region contains . This limitation has a particular impact on the behavior of iterative registration algorithms. In fact, to avoid an exhaustive search, e.g., block matching, the minimization of in (6) is usually performed by using gradient-based algorithms that iteratively optimize . Obviously, at every iteration of the algorithm, the overlapping region (which depends on the current estimate of ) must contain , since the error , as well as its gradient, depend on a sum over .

As illustrated by the first experiment of section VII, the minimum overlap requirement makes hard the automatic registration of arbitrary images. In fact, by specifying a priori a fixed window , we can not cope with all possible situations. If one specifies a small , it may fit into the true overlapping region, but the estimation error will be large due to the smooth minimum of . On the other hand, if one specifies a large , it is not possible to register images which have a small overlap. Our goal here is to develop a method to perform the registration in the situations when the overlap between the images is not known a priori.

Instead of using a fixed window , we propose an adaptive window , defined as the largest region for which it is possible to evaluate the error as defined in (6). In our iterative optimization, the estimate is computed by refining a previous estimate , i.e., . The update is estimated by minimizing the registration error over the adaptive window ,

(4) |

where is as defined in (6). The adaptive window , whose size and shape depend on the current estimate of the motion parameter vector, is the overlapping region between the image and the image registered according to , .

To compute the update , we develop an adaptive-window-based Gauss-Newton method. Similar methods have been used to minimize (6), i.e., to register images using a fixed window, e.g., [11]. In this method, is approximated by its first-order truncated Taylor series expansion, . Using this approximation in (4), and making zero the gradient of the cost function, we get as the solution of the linear system

(5) |

where we omit the dependency of on and for compactness. From the definition of in (6), we see that in (5) is computed from the image gradient, . The initial guess for is such that is the identity mapping, which corresponds to initializing the algorithm with zero displacement between the images, thus the initial window is the entire image region.

The Gauss-Newton method just described assumes the motion is small. To cope with large displacements, we use a multiresolution scheme. In such scheme, the iterative estimation algorithm is first used in a lower resolution versions of the input images, until a certain stopping criterium is reached . The resulting parameter estimates are then used as initial guesses for the parameters in the the next (higher) resolution and the process is repeated until the original images are used.

Among the number of valid stopping criteria, we combine the two most obvious: i) the maximum number of iterations; and ii) the minimum value of the norm of the update vector . Using only ii) is not adequate in the low resolution levels, where it is only necessary to make a coarse estimation of the parameters. In these levels, convergence may be slow and the overall performance of the algorithm is not affected if we simply perform a fixed number of iterations.

### Iv-C Impact of the window

The motion of the brightness pattern between two images and is described by the parametric mapping that maps each pixel of , with coordinates , into the corresponding pixel of . Featureless approaches to image registration estimate the global motion parameter vector by minimizing the error

(6) |

where the sum is over a fixed, pre-specified, rectangular window . When the overlap between the images is large, the window is simply chosen as a large rectangle in the interior of the image(s). However, when the overlap is small, it is difficult to select a priori an appropriate window , due to two reasons. First, since it is not known beforehand where the overlapping region is, its hard to choose a location for the window . Second, imposing a priori a small window, leads to less accurate estimates of because not only the minimum of in (6) becomes less sharp but also the local minima phenomena become more severe.

To illustrate the impact of the size of the window, we represent in Fig. 1 the typical evolution of in (6), as a function of a single motion parameter , for several sizes of . Naturally, as anticipated above, the larger is , the smaller is the domain in which can be evaluated. The several local minima and the smoothness of the minimum of at the true value in the top plots, obtained with relatively small windows, contrast with the single sharp minimum of the bottom-right plot, obtained with the largest window (note that the vertical scale is different from plot to plot).

## V Multi-frame Registration

### V-a Estimate of the registration parameters

Replacing the ML estimate of the panorama, given by (3), in the symmetric log-likelihood (II-B), we express this ML cost as a function of the unknown registration parameters alone. After algebraic manipulations, we get:

where is the error between the co-registered images and ,

(8) |

and is a weight that counts the number of images that have captured the pixel of the panorama, according to the registration parameters in , i.e.,

(9) |

By discarding from (V-A) the constant terms, i.e., the terms that do not depend on the unknown registration parameters , we conclude that the ML estimate for the problem of global multi-frame registration, is equivalent to the following minimization:

(10) |

For simplicity, when deriving (10) from (V-A), the sums were interchanged and the spatial region of summation was re-defined to take care of the windows in (V-A), i.e., in (10) is the region where the images and overlap,

(11) |

## Vi MLM Algorithm

Our algorithm to the minimization of the ML cost (10) uses an iterative scheme inspired in the common approaches to the two-frame problem [11, 13]. In each step, the algorithm updates a current estimate that we denote by .

### Vi-a Iterative minimization of the ML cost

Instead of updating the entire set of parameters in a single step, which would be computationally complex, we propose a coordinatewise minimization: we update each vector at a time, keeping fixed the remaining registration parameters . The update is , where is obtained from (10), after discarding the terms that do not depend on :

(12) |

To obtain a closed-form solution for the update , we approximate the error by its first-order Taylor series expansion,

From the definition of in (8), the gradient in the Taylor series expansion is easily computed in terms of the spatial gradient of image . Furthermore, that gradient does not depend on , thus we will denote it more compactly by ,

(13) | |||||

(14) |

### Vi-B Interpretation in terms of current algorithms

Since the iterations in standard featureless two-frame alignment algorithms [11, 13] also lead to a system like (15), we now interpret our solution (15,16,17) in terms of those approaches. Define as the difference between image and the previous estimate of the panorama, obtained with the registration parameters ,

(18) |

Since the gradient of this error wrt is equal to the one defined in (13), we can re-write expressions (16,17) in terms of ,

(19) | |||||

(20) |

Expressions (19,20) are equal to the ones that arise from aligning the previous estimate of the panorama with image , by using standard featureless methods, see e.g., [11, 13] or [3]. We thus conclude that our global approach lead to an algorithm that refines the estimate of the registration parameters of each image by using the methodology developed to register a single pair of images.

### Vi-C Convergence—initialization and multiresolution

Our algorithm starts by aligning the images sequentially, using the standard two-frame approach [11, 13]. Then, we compute an initial estimate of the panorama by using (3). After this, we cyclically refine the registrations parameters of each image. The stopping criterion may either be the error below a small threshold or reaching a maximum number of iterations.

Since the truncated Taylor series is a good approximation only when the vector is close to its initial value , estimating the update from (15,16,17) leads to the convergence to the globally optimal ML estimate, only when the initial estimate is close enough to it. However, in practice, e.g., in the first experiment described below, it is common that the initial estimate of the panorama is very rough, due to the propagation of (two-frame based) registration errors. To cope with these situations, we use a coarse-to-fine approach similar to the one proposed in [4, 11]: the parameters are first estimated in the coarsest resolution level, then used as an initialization to the next finer level, until the full image resolution is attained. As illustrated in the following section, this multi-resolution approach succeeds in correcting large miss-registrations.

## Vii Experiments

We describe two experiments. The first experiment compares our global approach with the current sequential registration methods. In the second experiment, we illustrate with automatic mosaic building in a seabed mapping context.

### Vii-a Adaptive window versus fixed window

To illustrate how our method performs better than the usual fixed window method, we synthesized input images by cropping a real photography and adding noise. This corresponds to a simple translational motion model, which suffices to show the advantage of using our adaptive window method.

In Fig. 2, the overlap between input images is large. The left image of Fig. 2 shows the failure of the algorithm with a fixed window of size 64. Our algorithm and the one with a fixed window of size 128 both lead to good results, see the middle and right images of Fig. 2. Note that, although these two images are visually indistinguishable, the estimate of the global motion provided by our algorithm is more accurate because it minimizes the error over the largest possible window.

In Fig. 3, the overlap between input images is small, thus it is impossible to use a fixed window of a large size. When using a fixed window of size 64, the usual algorithm fails, see the left image of Fig. 3. The right image of Fig. 3 shows that our algorithm succeeds in this challenging situation.

Finally, Fig. 4 shows a mosaic obtained by pairwise registering, sequentially, a set of input images.

### Vii-B MLM versus sequential alignment

To have an exact knowledge of the ground truth, we “synthesized” the input images by cropping a real photo and adding noise. In Fig. 5, we represent the evolution of the standard two-frame featureless sequential alignment (e.g., [11, 13]) of those images. Note that the fourth image is miss-aligned and how that error propagates to the alignment of the remaining images. The (highly incorrect) panorama this way obtained, see the bottom right image of Fig. 5, was then used as the initialization for the global method we propose in this paper. After few iterations, our algorithm converged to the panoramic image shown in Fig. 6, which is visually indistinguishable from the ground truth image.

### Vii-C Underwater mosaic for seabed mapping

As a final example, we illustrate our method with automatic mosaic construction from video images, captured by an underwater camera in the sea. Although underwater images are particularly difficult to align, due to the absence of salient features, the mosaic recovered by our algorithm is visually correct, see Fig. 8.

As a final example, we use images captured by an underwater camera in the sea. Fig. 7 shows four of those images. The low texture and the almost total absence of salient feature points make these images particularly challenging. Note also that although the overlapping region between images is not very small, its shape is not rectangular. In this situation, the traditional fixed window method would use a small rectangular window inside the overlapping region, thus failing to use all the information available.

## Viii Conclusion

We proposed a new method to build a panoramic image from a set of partial views. Rather than composing the input images in an incremental way, our approach seeks the global solution to the estimation problem, i.e., it computes the panorama that best matches all the partial observations. To minimize the global cost, we derived an efficient gradient descent algorithm that generalizes the current most robust two-frame featureless registration approaches.

## References

- [1] P. Aguiar, R. Jasinschi, J. Moura, and C. Pluempitiwiriyawej, “Content-based image sequence representation,” in Digital Video Processing, T. Reed, Ed. CRC Press, 2004.
- [2] P. Aguiar and J. Moura, “Detecting and solving template ambiguities in motion segmentation,” in IEEE ICIP, 1997.
- [3] ——, “Image motion estimation – convergence and error analysis,” in IEEE ICIP, Greece, 2001.
- [4] J. Bergen, P. Anandan, K. Hanna, and R. Hingorani, “Hierarchical model-based motion estimation,” in European Conf. on Computer Vision, Santa Margherita Ligure, Italy, 1992.
- [5] F. Dufaux and J. Konrad, “Efficient, robust, and fast global motion estimation for video coding,” IEEE T-IP, 2000.
- [6] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge University Press, 2000.
- [7] D. Hasler, L. Sbaiz, S. Ayer, and M. Vetterli, “From local to global pparameter estimation in panoramic photographic reconstruction,” in IEEE ICIP, Kobe, Japan, 1999.
- [8] N. Jojic and B. Frey, “Learning flexible sprites in video layers,” in IEEE Int. Conf. on CVPR, Hawaii, 2001.
- [9] D. Kim and K. Hong, “Fast global registration for image mosaicing,” in IEEE ICIP, Barcelona, Spain, 2003.
- [10] J. Lee and J. Ra, “Block motion estimation based on selective integral projections,” in IEEE ICIP, Rochester, USA, 2002.
- [11] S. Mann and R. Piccard, “Video orbits of the projective group: a simple approach to featureless estimation of parameters,” IEEE Trans. on Image Processing, 1997.
- [12] N. Petrovic, N. Jojic, and T. Huang, “Hierarchical video clustering,” in IEEE MMSP, Siena, Italy, 2004.
- [13] B. Pires and P. Aguiar, “Registration of images with small overlap,” in Proc. of the IEEE Multimedia Signal Processing Workshop, Siena, Italy, 2004.
- [14] ——, “Featureless global alignment of multiple images,” January 2005, submitted to IEEE Int. Conf. on Image Processing.
- [15] B. Reddy and B. Chattery, “An FFT-based technique for translation, rotation, and scale-invariant image registration,” IEEE Trans. on Image Processing, 1996.
- [16] J. Shi and C. Tomasi, “Good features to track,” in IEEE Int. Conf. on Computer Vision and Pattern Recognition, 1994.