Iterative Grassmannian Optimization for Robust Image Alignment

Iterative Grassmannian Optimization for Robust Image Alignment

Abstract

Robust high-dimensional data processing has witnessed an exciting development in recent years. Theoretical results have shown that it is possible using convex programming to optimize data fit to a low-rank component plus a sparse outlier component. This problem is also known as Robust PCA, and it has found application in many areas of computer vision. In image and video processing and face recognition, the opportunity to process massive image databases is emerging as people upload photo and video data online in unprecedented volumes. However, data quality and consistency is not controlled in any way, and the massiveness of the data poses a serious computational challenge. In this paper we present t-GRASTA, or “Transformed GRASTA (Grassmannian Robust Adaptive Subspace Tracking Algorithm)”. t-GRASTA iteratively performs incremental gradient descent constrained to the Grassmann manifold of subspaces in order to simultaneously estimate three components of a decomposition of a collection of images: a low-rank subspace, a sparse part of occlusions and foreground objects, and a transformation such as rotation or translation of the image. We show that t-GRASTA is faster than state-of-the-art algorithms, has half the memory requirement, and can achieve alignment for face images as well as jittered camera surveillance images.

1Introduction

With the explosion of image and video capture, both for surveillance and personal enjoyment, and the ease of putting these data online, we are seeing photo databases grow at unprecedented rates. On record we know that in July 2010, Facebook had 100 million photo uploads per day [?] and Instagram had a database of 400 million photos as of the end of 2011, with 60 uploads per second [?]; since then both of these databases have certainly grown immensely. In 2010, there were an estimated minimum 10,000 surveillance cameras in the city of Chicago and in 2002 an estimated 500,000 in London [?].

These enormous collections pose both an opportunity and a challenge for image processing and face recognition: The opportunity is that with so much data, it should be possible to assist users in tagging photos, searching the image database, and detecting unusual activity or anomalies. The challenge is that the data are not controlled in any way so as to ensure data quality and consistency across photos, and the massiveness of the data poses a serious computational challenge.

In video surveillance, many recently proposed algorithms model the foreground and background separation problem as one of “Robust PCA”– decomposing the scene as the sum of a low-rank matrix of background, which represents the global appearance and illumination of the scene, and a sparse matrix of moving foreground objects [?]. These popular algorithms and models work very well for a stationary camera. However, in the case of camera jitter, the background is no longer low-rank, and this is problematic for Robust PCA methods [?]. Robustly and efficiently detecting moving objects from an unstable camera is a challenging problem, since we need to accurately estimate both the background and the transformation of each frame. Figure 1 shows that for a video sequence generated by a simulated unstable camera, GRASTA [?] (Grassmannian Robust Adaptive Subspace Tracking Algorithm) fails to do the separation, but the approach we propose here, t-GRASTA, can successfully separate the background and moving objects despite camera jitter.

Figure 1:  Video background and foreground separation by t-GRASTA despite camera jitter. 1^{st} row: misaligned video frames by simulating camera jitters; 2^{nd} row: images aligned by t-GRASTA; 3^{rd} row: background recovered by t-GRASTA; 4^{th} row: foreground separated by t-GRASTA; 5^{th} row: background recovered by GRASTA; 6^{th} row: foreground separated by GRASTA.
Figure 1: Video background and foreground separation by t-GRASTA despite camera jitter. row: misaligned video frames by simulating camera jitters; row: images aligned by t-GRASTA; row: background recovered by t-GRASTA; row: foreground separated by t-GRASTA; row: background recovered by GRASTA; row: foreground separated by GRASTA.

Further recent work has extended the Robust PCA model to that of the “Transformed Low-Rank + Sparse” model for face images with occlusions that have come under transformations such as translations and rotations [?]. Without the transformations, this can be posed as a convex optimization problem and therefore convex programming methods can be used to tackle such a problem. In RASL [?] (Robust Alignment by Sparse and Low-Rank decomposition), the authors posed the problem with transformations as well, and though it is no longer convex it can be linearized in each iteration and proven to reach a local minimum.

Though the convex programming methods used in [?] are polynomial in the size of the problem, that complexity can still be too demanding for very large databases of images. We propose Transformed GRASTA, or t-GRASTA for short, to tackle this optimization with an incremental or online optimization technique. The benefit of this approach is three-fold: First, it will improve speeds of image alignment both in batch mode or in online mode, as we show in Section 3. Second, the memory requirement is small, which makes alignment for very large databases realistic, since t-GRASTA only needs to maintain low-rank subspaces throughout the alignment process. Finally, the proposed online version of t-GRASTA allows for alignment and occlusion removal on images as they are uploaded to the database, which is especially useful in video processing scenarios.

1.1Robust Image Alignment

The problem of robust image alignment arises regularly in real data, as large illumination variations and gross pixel corruptions or partial occlusions often occur, such as sunglasses or a scarf for a human subject. The classic batch image alignment approaches, such as congealing [?] or least squares congealing algorithms [?] cannot simultaneously handle such severe conditions, causing the alignment task to fail.

With the breakthrough of convex relaxation theory applied to decomposing matrices into a sum of low-rank and sparse matrices [?], the recently proposed algorithm “Robust Alignment by Sparse and Low-rank decomposition,” or RASL [?], poses the robust image alignment problem as a transformed version of Robust PCA. The transformed batch of images can be decomposed as the sum of a low-rank matrix of recovered aligned images and a sparse matrix of errors. RASL seeks the optimal domain transformations while trying to minimize the rank of the matrix of the vectorized and stacked aligned images and while keeping the gross errors sparse. While the rank minimization and minimization can be relaxed to their convex surrogates– minimize the corresponding nuclear norm and norm – the relaxed problem is still highly non-linear due to the complicated domain transformation.

Here, represents the data ( pixels per each of images), is the low-rank component, is the sparse additive component, and are the transformations. RASL proposes to tackle this difficult optimization problem by iteratively locally linearizing the non-linear image transformation , where is the Jacobian of image with respect to transformation ; then in each iteration the linearized problem is convex. The authors have shown that RASL works perfectly well for batch aligning the linearly correlated images despite large illumination variations and occlusions.

In order to improve the scalability of robust image alignment for massive image datasets, [?] proposes an efficient ALM-based (Augmented Lagrange Multiplier-based) iterative convex optimization algorithm ORIA (Online Robust Image Alignment) for online alignment of the input images. Though the proposed approach can scale to large image datasets, it requires the subspace of the aligned images as a prior, and for this it uses RASL to train the initial aligned subspace. Once the input images cannot be well aligned by the current subspace, the authors use an heuristic method to update the basis. In contrast, with t-GRASTA we include the subspace in the cost function, and update the subspace using a gradient geodesic step on the Grassmannian, as in [?]. We discuss this in more detail in the next section.

1.2Online Robust Subspace Learning

Subspace learning has been an area important to signal processing for a few decades. There are many applications in which one must track signal and noise subspaces, from computer vision to communications and radar, and a survey of the related work can be found in [?].

The GROUSE algorithm, or “Grassmannian Rank-One Update Subspace Estimation,” is an online subspace estimation algorithm that can track changing subspaces in the presence of Gaussian noise and missing entries [?]. GROUSE was developed as an online variant of low-rank matrix completion algorithms. It uses incremental gradient methods that have been receiving extensive attention in the optimization community [?]. However, GROUSE is not robust to gross outliers, and the follow-up algorithm GRASTA [?], can estimate a changing low-rank subspace as well as identify and subtract outliers. Still problematic is that, as we showed in Figure 1, even GRASTA cannot handle camera jitter. Our algorithm includes the estimation of transformations in order to align frames first before separating foreground and background.

2Robust Image Alignment via Iterative Online Subspace Learning

2.1Model

Batch mode

In order to robustly align the set of linearly correlated images despite sparse outliers, we consider the following matrix factorization model where the low-rank matrix has orthonormal columns that span the low-dimensional subspace of the well-aligned images.

We have replaced the variable with the product of two smaller matrices , and the orthonormal columns of span the low-rank subspace of the images. The set of all subspaces of of fixed dimension is called the Grassmannian, which is a compact Riemannian manifold and is denoted by . In this optimization model, is constrained to the Grassmannian . Though problem can not be directly solved [?] due to the nonlinearity of image transformation, if the misalignments are not too large, by locally linearly approximating the image transformation , the iterative model can work well as a practical approach.

At algorithm iteration , are the current estimated transformations at iteration , is the Jacobian of the -th image with respect to the transformation , and denotes the standard basis for . Note, at different iterations the subspace may have different dimensions, i.e. is constrained on different Grassmannian .

At each iteration of the iterative model , we consider this optimization problem as the subspace learning problem. That is, our goal is to robustly estimate the low-dimensional subspace which best represents the locally transformed images despite sparse outliers . In order to solve this subspace learning problem both efficiently with regards to both computation and memory, we propose to learn at each iteration in model via the online robust subspace learning approach [?].

Online mode

In order to perform online video processing tasks, for example video stabilization, it is desirable to design an efficient approach that can handle image misalignment frame by frame. As in the previous discussion regarding batch mode processing, for each video frame , we may model the minimization problem as follows:

Note that with the constraint in the above minimization problem, we suppose for each frame the transformed image is well aligned to the low-rank subspace . However, due to the nonlinear geometric transform , directly exploiting online subspace learning techniques [?] is not possible.

Here we approach this as a manifold learning problem, supposing that the low-dimensional image subspace under nonlinear transformations forms a nonlinear manifold. We propose to learn the manifold approximately using a union of subspaces model , . The basic idea is illustrated in Figure 2, and the locally linearized model for the nonlinear problem is as follows:

Figure 2:  The illustration of iteratively approximating the nonlinear image manifold using a union of subspaces.
Figure 2: The illustration of iteratively approximating the nonlinear image manifold using a union of subspaces.

Intuitively, from Figure 2, it is reasonable to think that the initial misaligned image sequence should be high rank; then after iteratively approximating the nonlinear transform with a locally linear approximation, the rank of the new subspaces , , should be decreasing as the images become more and more aligned. Then for each misaligned image and the unknown transformation , we iteratively update the union of subspaces , , and estimate the transformation . Details of the online mode of t-GRASTA will be discussed in Section ?

The use of a union of subspaces , , to approximate the nonlinear manifold is a crucial innovation for this fully online model. Though we use the symbols and in both the batch mode and the online mode, they have two different interpretations. For batch mode, is the iteratively learned aligned subspace in each iteration; while for online mode, , , is a collection of subspaces which are used for approximating the nonlinear transform, and they are updated iteratively for each video frame.

2.2ADMM Solver for the Locally Linearized Problem

Whether operating in batch mode or online mode, the key problem is how to quantify the subspace error robustly for the locally linearized problem. Considering batch mode, at iteration , given the -th image , its estimate of transformation , the Jacobian , and the current estimate of , we use the norm as follows:

With known (or estimated, but fixed), this minimization problem is a variation of the least absolute deviations problem, which can be solved efficiently by ADMM (Alternating Direction Method of Multipliers) [?]. We rewrite the right hand of as the equivalent constrained problem by introducing a sparse outlier vector :

The augmented Lagrangian of problem is

where , and is the Lagrange multiplier or dual vector.

Given the current estimated subspace , transformation parameter , and the Jacobian matrix with respect to the -th image , the optimal can be computed by the ADMM approach as follows:

where is the elementwise soft thresholding operator [?], and is the ADMM penalty constant enforcing to be a monotonically increasing positive sequence. The iteration indeed converges to the optimal solution of the problem [?]. We summarize this ADMM solver as Algorithm ? in Section 2.4.

2.3Subspace Update

Whether identifying the best in the batch mode or estimating the union of subspaces , , in the online mode , optimizing the orthonormal matrix along the geodesic of Grassmannian is our key technique. For clarity of exposition in this section, we remove the superscript or from , as the core gradient step along the geodesic of the Grassmannian for both batch mode and online mode is the same. We seek a sequence such that (as ). We now face the choice of an effective subspace loss function. Regarding as the variable, the loss function is not differentiable everywhere. Therefore, we choose to instead use the augmented Lagrangian as the subspace loss function once we have estimated by ADMM from the previous [?].

In order to take a gradient step along the geodesic of the Grassmannian, according to [?], we first need to derive the gradient formula of the real-valued loss function . The gradient can be determined from the derivative of with respect to the components of :

Then the gradient is [?]. From Step ? of Algorithm ?, we have that (see the definition of in Alg. ?). It is easy to verify that is rank one since is a vector and is a weight vector. The following derivation of geodesic gradient step is similar to GROUSE [?] and GRASTA [?]. We rewrite the important steps of the derivation here for completeness.

The sole non-zero singular value is , and the corresponding left and right singular vectors are and respectively. Then we can write the SVD of the gradient explicitly by adding the orthonormal set orthogonal to as left singular vectors and the orthonormal set orthogonal to as right singular vectors as follows:

Finally, following Equation (2.65) in [?], a geodesic gradient step of length in the direction is given by

2.4Algorithms

Batch Mode

From the discussion of of Sections Section 2.2 and Section 2.3, given the batch of unaligned images , their estimate of transformation and their Jacobian at iteration , we can robustly identify the subspace by incrementally updating along the geodesic of Grassmannian . When (as ), the estimate of for each initially aligned image also approaches its optimal value . Once the subspace is accurately learned, we will update the estimate of the transformation for each image using . Then in the next iteration, the new subspace can also be learned from , and the algorithm iterates until we reach the stopping criterion, e.g. if or we reach the maximum iteration .

We summarize our algorithms as follows. Algorithm ? is the batch image alignment approach via iterative online robust subspace learning. For Step ?, there are many ways to pick the step-size. For some examples, you may consider the diminishing and constant step-sizes adopted in GROUSE [?], or the multi-level adaptive step-size used for fast convergence in GRASTA [?].

Algorithm ? is the ADMM solver for the locally linearized problem . From our extensive experiments, if we set the ADMM penalty parameter and the tolerance , Algorithm ? has always converged in fewer than iterations.

Online Mode

In Section ?, we propose to tackle the difficult nonlinear online subspace learning problem by iteratively learning online a union of subspaces , . For a sequence of video frames , the union of subspaces are updated iteratively as illustrated in Figure 3.

Specifically, at -th frame , for the locally approximated subspace at the first iteration, given the initial roughly estimated transformation , the ADMM solver Algorithm ? gives us the locally estimated , and the updated subspace is obtained by taking a gradient step along the geodesic of the Grassmannian as discussed in Section 2.3. The transformation of the next iteration is updated by . Then for the next locally approximated subspace , we also estimate and update the subspace along the geodesic of the Grassmannian to . Repeatedly, we will update in the same way to get and the new transformation . After completing the update for all subspaces, the union of subspaces will be used for approximating the nonlinear transform of the next video frame .

We summarize the above statements as Algorithm ?, and we call this approach the fully online mode of t-GRASTA.

Figure 3:  The diagram of the fully online mode of t-GRASTA.
Figure 3: The diagram of the fully online mode of t-GRASTA.

Discussion of Online Image Alignment

If the subspace of the well-aligned images is known as a prior, for example if is trained by Algorithm ? from a “well selected” dataset of one category, we can simply use to align the rest of the unaligned images of the same category. Here “well selected” means the training dataset should cover enough of the global appearance of the object, such as different illuminations, which can be represented by the low-dimensional subspace structure. By category, we mean a particular object of interest or a particular background scene in the video surveillance data.

For massive image processing tasks, it is easy to collect such good training datasets by simply randomly sampling a small fraction of the whole image set. Once is learned from the training set, we can use a variation of Algorithm ? to align each unaligned image without updating the subspace, since we have the assumption that the remaining images also lie in the trained subspace. We call Algorithm ? the trained online mode.

However, we note that for a very large streaming dataset such as is typical in real-time video processing, the trained online mode may be less well-defined, as the subspace of the streaming video data may change over time. For this scenario, our fully online mode for t-GRASTA could gradually adapt to the changing subspace and then accurately estimate the transformation .

2.5Discussion of Memory Usage

We compare the memory usage of our fully online mode of t-GRASTA to that of RASL. RASL requires storage of , , a Lagrange multiplier matrix , the data , and , each of which require storage of the size . To compare fairly to t-GRASTA, which assumes a -dimensional model, we suppose RASL uses a thin singular value decomposition of size , which requires memory elements. Finally for the Jacobian per image, RASL needs , and for RASL needs , but we will assume is a small constant independent of dimension and ignore it. Therefore RASL’s total memory usage is .

t-GRASTA must also store the Jacobian, , and the data as well as the data with transformation, using memory size . Otherwise, t-GRASTA needs to store the union of subspaces , matrices of size , and the vectors , , , and for memory elements. Thus t-GRASTA’s memory total is .

For a problem size of 100 images, each with 100100 pixels, and assuming , , t-GRASTA uses 66.1% of the memory of RASL. For 10000 mega-pixel images, t-GRASTA uses 50.1% of the memory of RASL. The scaling remains about half throughout mid-range to large problem sizes.

3Performance Evaluation

In this section, we conduct comprehensive experiments on a variety of alignment tasks to verify the efficiency and superiority of our algorithm. We first demonstrate the ability of the proposed approach to cope with occlusion and illumination variation during the alignment process. After that, we further demonstrate the robustness and generality of our approach by testing it on handwritten digits and face images taken from the Labeled Faces in the Wild database [?]. Finally, we apply our approach to dealing with video jitters and solving the interesting background foreground separation problem.

3.1Occlusion and illumination variation

We first test our approach on the dataset ‘dummy’ described in [?]. Here, we want to verify the ability of our approach to effectively align the images despite occlusion and illumination variation. The dataset contains 100 images of a dummy head taken under varying illumination and with artificially generated occlusions created by adding a square patch at a random location of the image. Figure 4 shows 10 misaligned images of the dummy. We align these images by Algorithm ? (the batch mode of t-GRASTA). The canonical frame is chosen to be pixels and the subspace dimension is set to 5. Here and in the rest of our experiments, for simplicity we set of Algorithm ? to a fixed in every iteration. The last three rows of Figure 4 show the results of alignment, from which we can see that our approach is successful at aligning the misaligned images while removing the occlusion at the same time.

Figure 4:  The first row shows the original misaligned images with occlusions and illumination variation; the second row shows the images aligned by t-GRASTA; the third row shows the recovered aligned images without occlusion; and the bottom row is the occlusion removed by our approach.
Figure 4: The first row shows the original misaligned images with occlusions and illumination variation; the second row shows the images aligned by t-GRASTA; the third row shows the recovered aligned images without occlusion; and the bottom row is the occlusion removed by our approach.
Figure 5:  (a) Average of 16 misaligned subjects randomly selected from LFW database; (b) average of each subject aligned by t-GRASTA; (c) initial images of John Ashcroft (marked by red boxs in (a) and (b)); (d) images aligned by t-GRASTA.
Figure 5: (a) Average of 16 misaligned subjects randomly selected from LFW database; (b) average of each subject aligned by t-GRASTA; (c) initial images of John Ashcroft (marked by red boxs in (a) and (b)); (d) images aligned by t-GRASTA.

3.2Robustness

In order to further demonstrate the robustness of our approach, we apply it on more realistic images taken from the Labeled Faces in the Wild database [?]. The LFW contains more severely misaligned images, for it also includes remarkable variations in pose and expression aside from illumination and occlusion, which can be seen in Figure 5(c). We chose 16 subjects from LFW, each of them with 35 images. Each image is aligned to an canonical frame using which are from the group of affine transformations , as in [?]; these are translations, rotations, and scale transformations. For each subject, we set the subspace dimension = 15 and use Algorithm ? to align each image. In this example, we demonstrate the robustness of our approach by comparing the average face of each subject before and after alignment, which are shown in Figure 5(a)-(b). We can see that the average faces after alignment are much clearer than those before alignment. Figure 5(c)-(d) provides more detailed information, showing the unaligned and aligned images of John Ashcroft (marked by red boxes in Figure 5(a)-(b)).

3.3Generality

The previous experiments have demonstrated the effectiveness and robustness of t-GRASTA. Here we wish to show the generality of t-GRASTA by applying it to aligning a different type of images – handwritten digits taken from MINST database. For this experiment, we again use Algorithm ? to align 100 images of a handwritten “3” to a canonical frame size. We use Euclidean transformation and set the dimension of the subspace to be 5.

Figure 6 shows that t-GRASTA can successfully align the misaligned digits and learn the low dimensional subspace, even though the original digits have significant variation. We can see that the outliers separated by t-GRASTA are generated by variations in the digits that are not consistent with the global appearance. The outliers (d) would be even more sparse if the subspace representation in (c) were to capture more of this variation; If desired, we could achieve this tradeoff by increasing the dimension of the subspace.

Figure 6:  (a) 100 misaligned digits; (b) digits aligned by t-GRASTA; (c) subspace representation of corresponding digits; (d) outliers.
Figure 6: (a) 100 misaligned digits; (b) digits aligned by t-GRASTA; (c) subspace representation of corresponding digits; (d) outliers.

3.4Video Jitter

In this section, we apply t-GRASTA to separation problems made difficult by video jitter. Here we apply both the fully online mode Algorithm ? and the trained online mode Algorithm ? to different datasets. We show the superiority of t-GRASTA regarding both the speed and memory requirement of the algorithms.

Hall

Here we apply t-GRASTA to the task of separating moving objects from static background in the video footage recorded by an unstable camera. We note that in [?], the authors simulate a virtual panning camera to show that GRASTA can quickly track sudden changes in the background subspace caused by a moving camera. Their low-rank subspace tracking model is well-defined, as the camera after panning is still stationary, and thus the recorded video frames are accurately pixelwise aligned. However, for an unstable camera, the recorded frames are no longer aligned; the background cannot be well represented by a low-rank subspace unless the jittered frames are first aligned. In order to show that t-GRASTA can tackle this separation task, we consider a highly jittered video sequence generated by a simulated unstable camera. To simulate the unstable camera, we randomly translate the original well-aligned video frames in x- / y- axis and rotate them in the plane.

In this experiment, we compare t-GRASTA with RASL and GRASTA. We use the first 200 frames of the “Hall” dataset1, each pixels. We first perturb each frame artificially to simulate camera jitter. The rotation of each frame is random, uniformly distributed within the range of [], and the ranges of x- and y-translations are limited to [,] and [,]. In this example, we set the perturbation size parameters [,,] with the values of [ 20,20,].

For comparing with RASL, unlike [?], we just let RASL run its original batch model without forcing it into an online algorithm framework. The task we give to RASL and t-GRASTA is to align each frame to a canonical frame, again using . The dimension of the subspace in t-GRASTA is set to be 10. We first randomly select 30 frames of the total 200 frames to train the subspace by Algorithm ? and then align the rest using the trained online mode. The visual comparison between RASL and t-GRASTA are shown in Figure 7. Table ? illustrates the numerical comparison of RASL and t-GRASTA, for which we ran each algorithm 10 times to get the statistics. From Table ? and Figure 7 we can see that the two algorithms achieve a very similiar effect, but t-GRASTA runs much faster than RASL: On a PC with Intel P9300 2.27GHz CPU and 2 GB of RAM, the average time for aligning a newly arrived frame is 1.1 second, while RASL needs more than 800 seconds to align the total batch of images, or 4 seconds per frame. Moreover, our approach is also superior to RASL regarding memory efficiency. These superiorities become more dramatic as one increases the size of the image database.

Statistics of errors in two pixels and , selected from the original video frames and traced through the jitter simulation process to the RASL and t-GRASTA output frames. Max error and mean error are calculated as the distances from the estimated and to their statistical center and . Std are calculated as the standard deviation of four coordinate value for and for across all frames.
Max Mean X1 std Y1 std X2 std Y2 std
error error
Initial misalignment 11.24 5.07 3.35 3.01 3.34 4.17
RASL 2.96 1.73 0.56 0.71 0.90 1.54
t-GRASTA 6.62 0.84 0.48 1.11 0.57 0.74

Figure 7: Comparison between t-GRASTA and RASL. (a) Average of initial misaligned images; (b) average of images aligned by t-GRASTA; (c)average of background recovered by t-GRASTA; (d) average of images aligned by RASL; (e) average of background recovered by RASL.
Figure 7: Comparison between t-GRASTA and RASL. (a) Average of initial misaligned images; (b) average of images aligned by t-GRASTA; (c)average of background recovered by t-GRASTA; (d) average of images aligned by RASL; (e) average of background recovered by RASL.
Figure 8: Video background and foreground separation with jittered video. 1^{st} row: 8 misaligned video frames randomly selected from artificially perturbed images; 2^{nd} row: images aligned by t-GRASTA; 3^{rd} row: background recovered by t-GRASTA; 4^{th} row: foreground separated by t-GRASTA; 5^{th} row: background recovered by GRASTA; 6^{th} row: foreground separated by GRASTA.
Figure 8: Video background and foreground separation with jittered video. row: 8 misaligned video frames randomly selected from artificially perturbed images; row: images aligned by t-GRASTA; row: background recovered by t-GRASTA; row: foreground separated by t-GRASTA; row: background recovered by GRASTA; row: foreground separated by GRASTA.

In order to compare with GRASTA, we use 200 perturbed images to recover the background and separate the moving objects for both algorithms; Figure 8 illustrates the comparison. For both GRASTA and t-GRASTA, we set the subspace rank = 10 and randomly selected 30 images to train the subspace first. For t-GRASTA, we use the affine transformation . From Figure 8, we can see that our approach successfully separates the foreground and the background and simultaneously align the perturbed images. But GRASTA fails to learn a proper subspace, thus, the separation of background and foreground is poor. Although GRASTA has been demonstrated to successfully track a dynamic subspace, e.g. the panning camera, the dynamics of an unstable camera are too fast and unpredictable for the GRASTA subspace tracking model to succeed in this context without pre-alignment of the video frames.

Gore

In this example, we show the capability of t-GRASTA for video stabilization applied to the dataset “Gore” described in [?]. In [?], the original face images are obtained by a face detector, and the jitters are caused by the inherent imprecision of the detector. In contrast, for t-GRASTA, we simply crop the face from each image by a constant rectangle with size , which has the same position parameters for all frames. So in our case, the jitters are caused by the differences between the motion and pose variation of the target and the stabilization of the constant rectangle.

Figure 9:  The first row shows the original misaligned images; the second row shows the images aligned by t-GRASTA; the third row shows the recovered aligned images without outliers; and the bottom row shows the outliers removed by our approach.
Figure 9: The first row shows the original misaligned images; the second row shows the images aligned by t-GRASTA; the third row shows the recovered aligned images without outliers; and the bottom row shows the outliers removed by our approach.

For this experiment, the dimension of the subspace is set to be 10, and we again choose the affine transformation . We first use the Algorithm ? to train an initial subspace by 20 images randomly selected from the whole set of 140 images. We then use the fully online mode to align the rest of the images. Figure 9 show the results. t-GRASTA did well for this dataset with better speed than RASL: On a PC with Intel P9300 2.27GHz CPU and 2 GB of RAM, t-GRASTA aligned these images at 5 frames per second. This is 5 times faster than RASL and 3 times faster than ORIA as described in [?].

Although t-GRASTA was not designed as a face detector, the experimental results suggest that t-GRASTA can be transformed into a face detector, or more generally target tracker, if the variation of pose of the target is limited in a certain range (usually ). In this case, we can further improve the efficiency of t-GRASTA by choosing a tight frame for the canonical image.

Sidewalk

In the last experiment, we use misaligned frames caused by real camera jitter to test t-GRASTA. Here we align all 1200 frames of “Sidewalk” dataset2 to canonical frames, again using and subspace dimension 5. We also use the first 20 frames to train the initial subspace using the batch mode Algorithm ?, and then use the fully online mode to align the rest of the frames. Here we can see that aligning the total 1200 frames is a heavy task for RASL – for our PC with Intel P9300 2.27GHz CPU and 2 GB of RAM, it was necessary to divide the dataset into four parts each containing 300 frames. We then let RASL separately run on each sub-dataset. The total time needed by RASL was around 1000 seconds for 1.2 frames per second, while t-GRASTA achieved more than 4 frames per second without partitioning the data.

Compared to the trained online mode, the fully online mode can track changes of the subspace over time. This is an important asset of the fully online mode, especially when it comes to large streaming datasets containing considerable variations. We see that we usually need no more than 20 frames for fully online mode to adapt to the changes of the subspace, such as illumination changes or dynamic background caused by the motion of the subspace. Moreover, if the changes are slow, i.e the natural illumination changes from daylight or the camera moving slowly, then t-GRASTA needs no extra frames to track such changes; it incorporates such information with each iteration during the slowly changing process.

Figure 10: Video background and foreground separation with jittered video. 1^{st} row: 8 original misaligned video frames caused by video jitter; 2^{nd} row: images aligned by t-GRASTA; 3^{rd} row: background recovered by t-GRASTA; 4^{th} row: foreground separated by t-GRASTA.
Figure 10: Video background and foreground separation with jittered video. row: 8 original misaligned video frames caused by video jitter; row: images aligned by t-GRASTA; row: background recovered by t-GRASTA; row: foreground separated by t-GRASTA.

4Conclusions and Future Work

4.1Conclusions

In this paper we have presented an iterative Grassmannian optimization approach to simultaneously identify an optimal set of image domain transformations for image alignment and the low-rank subspace matching the aligned images. These are such that the vector of each transformed image can be decomposed as the sum of a low-rank part of the recovered aligned image and a sparse part of errors. This approach can be regarded as an extension of GRASTA and RASL: We extend GRASTA to transformations, and extend RASL to the incremental gradient optimization framework. Our approach is faster than RASL and more robust to alignment than GRASTA. We can effectively and computationally efficiently learn the low-rank subspace from misaligned images, which is very practical for computer vision applications.

4.2Future Work

Though this work presents an approach for robust image alignment more computationally efficient than state-of-the-art, a foremost remaining problem is how to scale the proposed approach to a very large streaming dataset such as is typical in real-time video processing. The fully online t-GRASTA algorithm presented here is a first step towards a truly large-scale real-time algorithm, but several practical implementation questions remain, including online parameter selection and error performance cross-validation. Another question of interest is regarding the estimation of for the subspace update. Though we fix the rank in this paper, estimating and switching between Grassmannians is a very interesting future direction.

While preparing the conference version of this work [?], we noticed an interesting alignment approach proposed in [?]. Though the two approaches of ours and [?] are both obtained via optimization over a manifold, they perform alignment for very different scenarios. For example, the approach in [?] focuses on semantically meaningful videos or signals, and then it can successfully align the videos of the same object from different views; t-GRASTA manipulates the set of misaligned images or the video of unstable camera to robustly identify the low-rank subspace, and then it can align these images according to the subspace. An intriguing future direction would be to merge these two approaches.

A final direction of future work is toward applications which require more aggressive background tracking than is possible by a GRASTA-type algorithm. For example, if a camera is following an object around different parts of a single scene, even though the background may be quickly varying from frame to frame, the camera will get multiple shots of different pieces of the background. Therefore, it may be possible to still build a model for the entire background scene using low-dimensional modeling. Incorporating camera movement parameters and a dynamical model into GRASTA would be a natural way to solve this problem, merging classical adaptive filtering algorithms with modern manifold optimization.

5Acknowledgements

This work of Jun He is supported by NSFC (61203273) and by Collegiate Natural Science Fund of Jiangsu Province (11KJB510009). Laura Balzano would like to acknowledge 3M for generously supporting her Ph.D. studies.

Footnotes

  1. Find these along with the videos at http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html.
  2. Find it along with other datasets containing misaligned frames caused by real video jitters at http://wordpress-jodoin.dmi.usherb.ca/dataset.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
12343
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description