Dense Non-rigid Structure-from-Motion Made Easy – A Spatial-Temporal Smoothness based Solution

# Dense Non-rigid Structure-from-Motion Made Easy – A Spatial-Temporal Smoothness based Solution

###### Abstract

This paper proposes a simple spatial-temporal smoothness based method for solving dense non-rigid structure-from-motion (NRSfM). First, we revisit the temporal smoothness and demonstrate that it can be extended to dense case directly. Second, we propose to exploit the spatial smoothness by resorting to the Laplacian of the 3D non-rigid shape. Third, to handle real world noise and outliers in measurements, we robustify the data term by using the norm. In this way, our method could robustly exploit both spatial and temporal smoothness effectively and make dense non-rigid reconstruction easy. Our method is very easy to implement, which involves solving a series of least squares problems. Experimental results on both synthetic and real image dense NRSfM tasks show that the proposed method outperforms state-of-the-art dense non-rigid reconstruction methods.

Dense Non-rigid Structure-from-Motion Made Easy – A Spatial-Temporal Smoothness based Solution

Yuchao Dai,  Huizhong Deng,  Mingyi He thanks: This work is supported in part by the Australian Research Council Grant (DE140100180) and National Natural Science Foundation of China (61420106007, 61671387). Yuchao Dai (daiyuchao@gmail.com) is the corresponding author.
Research School of Engineering, Australian National University, Australia
School of Electronics and Information, Northwestern Polytechnical University, China

Index Terms—  Non-rigid structure-from-motion, dense reconstruction, spatial-temporal smoothness

## 1 Introduction

Non-rigid structure-from-motion (NRSfM) aims at simultaneously recovering the camera motion and non-rigid structure from 2D images by using a monocular camera, which is central to many computer vision applications (3D reconstruction, motion capture, human-computer interaction etc) and has received considerable attention in recent years. A great number of methods have been established, and most of the existing methods can be roughly categorized as sparse methods and dense methods [1] [2][3] [4] [5][6] [7].

NRSfM is in essential under-determined (estimating a 3D point from a single 2D measurement), therefore, extra regularization is needed to constrain the problem. For sparse NRSfM, various priors/constraints have been enforced, such as shape basis [1], trajectory basis [8], shape-trajectory basis [9][7] and smoothness [10]. In sparse reconstruction, the feature points are geometrically apart from each other, thus spatial regularization cannot be enforced. By contrast, dense NRSfM aims at achieving 3D non-rigid reconstruction for each pixel in a video sequence, where spatial constraint has been widely exploited to regularize the problem [11][12][13]. Garg et al.[14] proposed to enforce both the total variation constraint and the nuclear norm induced low-rank constraint on the 3D non-rigid shape. This results in a complex convex optimization and GPU is needed to speed up the implementation. Furthermore, they only implemented the method on complete and noise-free datasets, thus its robustness remains questionable. In Russell et al.’s work[13], segmentation is performed on both object-level and part-level, then piece-wise reconstruction is applied by assuming locally rigid pieces. In [15], motion segmentation is paired with rank constrained 2D track completion to deal with occlusions, then nuclear norm minimization is used to recover the 3D shape. Yu et al.[16] proposed to utilize the temporal smoothness in both camera motion and 3D deformation, where a template 3D shape is available. Ranftl et al.[17] investigated the relative scale in dynamic scene. All these constraints are based on motion and semantic segmentation, thus computational complex.

In this paper, we look for a simple and elegant convex optimization for dense NRSfM that can be efficiently implemented on a CPU. We would like to argue that the inherent spatial and temporal smoothness constraints could be well exploited to regularize the dense non-rigid reconstruction problem. Specifically, we revisit the temporal smoothness in sparse reconstruction and demonstrate that it can be employed in dense case directly. Second, to exploit the spatial smoothness in dense reconstruction, we resort to the Laplacian of the 3D non-rigid shape, which captures the local smoothness and owns mathematical simplicity. Finally, to handle inevitable noise and outliers in real world image measurements, we robustify the data term by using the norm rather than commonly used norm. In this way, our method could robustly exploit the spatial-temporal smoothness in dense non-rigid reconstruction effectively. Our method is very easy to implement, which involves solving a series of least squares problems. In Fig. 1, we demonstrate the contribution of each component. With the introduction of temporal smoothness, spatial smoothness and robust cost function, the dense 3D non-rigid reconstruction has been gradually improved.

## 2 Prerequisites

Dense NRSfM takes a 2D video obtained by a monocular camera as input, with image frames each containing pixels. In this paper, we assume the per-pixel feature tracks have been extracted, say by optical flow or dense matching. Thus, the input to our system is a feature track matrix , which stacks following matrix: where denotes the -th feature point captured in the -th image frame. Assuming an orthographic camera model and the camera has been centralized at the center of the object, we have: where is a matrix that represents the first two rows of the rotation matrix of the -th frame, and is a matrix containing the 3D positions of every point in the -th non-rigid shape. Stacking all the feature tracks for all frames gives:

 \bfW=\bfR\bf% S, (1)

with and are with dimension and , respectively. Dense NRSfM aims at simultaneously recovering both the camera motion and the non-rigid shape from the feature track matrix . The problem is inherently under-determined as the number of variables to estimate () greatly exceeds the number of measurements . Therefore, extra constraints are needed to regularize the problem.

Under dense NRSfM, we solve for the camera rotation by utilizing the low-rank structure of . Even though we have to deal with tens of thousands of points, the rotation estimation method in [10] still could handle it as the computational complexity is independent of the number of points but only depends on the model complexity .

## 3 Formulation and Solution

In this paper, we propose to exploit the generic and generally available smoothness from temporal direction and spatial direction. By jointly enforcing the spatial and temporal constraints, we are able to achieve dense non-rigid reconstruction in an easy and elegant way.

### 3.1 Temporal Smoothness Revisited

First, we revisit the temporal smoothness, which has been widely used in sparse NRSfM [10]. We would like to argue that this simple strategy could be pretty efficient in achieving comparable performance with complex convex optimization or ADMM based methods.

By introducing smooth deformation regularization [18][19] [20], we can formulate the non-rigid shape recovery problem as minimizing a data term evaluated on the image measurements and a regularization term based on temporal smoothness, thus reaching the following optimization:

 min\bfS12∥\bfW−\bfR\bfS∥2F+12λ∥\bfH\bfS∥2F, (2)

where the first term measures the reprojection error evaluated on image plane while the second term measures the temporal smoothness constraint. We could apply different smooth operators to characterize various kinds of smoothness in temporal direction, e.g. first order smoothness as in Eq.-(3), second order smoothness and etc..

 \bfHij=⎧⎪⎨⎪⎩1,j=i,i=1,⋯,3(F−1),−1,j=i+3,i=1,⋯,3(F−1),0,Otherwise. (3)

The resultant optimization problem in Eq.-(2) admits an analytical (closed-form) solution,

 \bfSsmooth=(\bfRT\bfR+λ\bfHT\bfH)†\bfRT\bfW. (4)

The rotation matrix is of row full rank thus is of rank generally. The smoothness matrix is rank deficient too, thus is of rank (for first order smoothness). In general case, is a full rank matrix, thus invertible.

The 3D non-rigid shape generated by this solution depends on the choice of the trade-off parameter , which trades off between 2D reprojection error and temporal smoothness. When approaches 0, the solution approaches , i.e. the pseudo-inverse solution. When is large enough, the solution approaches a rigid shape, which minimizes the combination of and . When approaches , the solution approaches a trivial solution [10].

Connection: The smoothness constrained solution and the pseudo-inverse solution are connected as:

 \bfSSmooth=(\bfRT\bfR+λ\bfHT\bfH)†\bfRT\bfW=(% \bfRT\bfR+λ\bfHT\bfH)†\bfSPI. (5)

Therefore . As proved in [10], the pseudo-inverse solution is a degenerate case where the non-rigid shape at each frame lies on a plane. can be viewed as a per-frame weighted version of .

### 3.2 Spatial Smoothness Simplified

The temporal smoothness constrains the dense non-rigid reconstruction from the temporal dimension, i.e., the smoothness of 3D trajectory. However, it could not regularize the 3D shape at each frame. Garg et al.[14] proposed to use the total variation to encourage the spatial smoothness while maintaining sharp boundaries. The resultant optimization prohibits its real world application to large scale 3D reconstruction.

To efficiently and effectively utilize the smoothness alongside the spatial dimension, we propose a simple filtering mechanism, namely Laplacian filter, which enforces spatial smoothness locally in the 3D shape space. In Fig. 2, we illustrate different 2D filters in enforcing spatial smoothness. The Laplacian filter enforces a locally linear/planar model, which provides an easy way to encourage second order smoothness. As all linear filtering can be equivalently expressed as matrix multiplication, for the recovered non-rigid shape , the filtering output is defined as:

 \bfAvec(\bfS), (6)

where is a matrix containing all the filtering operation, each row of defines a spatial filter evaluated at the position of .

Spatial smoothness is effective in smoothing 3D reconstruction. However, the spatial smoothness itself is not sufficient to recover the correct shape. Without temporal constraint, the result will be close to the pseudo-inverse case, which lies in a plane. By putting spatial smoothness and temporal smoothness together, we are able to achieve reliable 3D reconstruction even from noisy 2D inputs.

### 3.3 Optimization Robustified

Noise and outliers are inevitable in real world measurements. Dense NRSfM methods must handle them robustly. Most of the existing methods apply on the data term, thus could not handle noise and outliers well. We propose to replace the norm with norm, thus increasing the robustness of the data term .

To deal with the convex norm efficiently, we propose to use iterative reweighted least square (IRLS), where we solve for a least square problem in each iteration. Figure 1 illustrates the performance of -norm on data with outliers. It is shown that our L1-norm relaxation gives a better performance in data with outliers.

### 3.4 Spatial-Temporal smoothness constraint

By enforcing the spatial-temporal smoothness constraint and applying the robust norm on data term, we reach:

 min\bfS∥\bfW−\bfR\bf% S∥1+λ1∥% \bfH\bfS∥2F+λ2∥\bfAvec(\bfS)∥2F, (7)

where and are the trade-off parameters. The three terms are “data term”, “temporal smoothness term” and “spatial smoothness term” correspondingly. Under IRLS formulation, we solve the following least square problem in each iteration:

 min\bfSit∥\bfE(\bfW−% \bfR\bfS)∥2F+λ1∥\bfH\bf% S∥2F+λ2∥\bfAvec(\bfS)∥2F. (8)

A closed-form solution can be derived by using the first order condition. However the computational complexity is high due to the filtering matrix . Instead, we propose to solve the least square problem with gradient descent, where the gradient is derived as:

 g(\bfS)=2\bfRT\bfET%$E$\bfR\bfS−2\bfRT\bfW+2λ1\bfHT\bfH\bfS+2λ2ivec((\bfAT\bfA)vec(\bfS)), (9)

denotes the inverse operator of vectorization, which transforms a vector to matrix with proper dimension.

## 4 Experimental Results

Setting up: To evaluate our method against existing state-of-the-art dense NRSfM methods, we used the 4 dense synthetic sequences and 3 real videos from [14]. Each sequence contains a 2D correspondence matrix and a quad mesh for neighborhood assignment. These sequences have over 20,000 trajectories forming dense surfaces, which makes the problem much more challenging than the sparse scenarios.

We first enforced the temporal smoothness constraint to obtain initialized 3D non-rigid reconstruction. Then our method runs iteratively to optimize the cost function with spatial-temporal constraints. The trading-off parameters are set as , .

On synthetic face sequences, the results of our method are shown in Fig. 3. We overlap the ground truth shape in red and the our 3D reconstruction in blue. These figures show that our method can reconstruct the 3D object quite accurately. Table 1 shows the quantitative evaluation of our method along with various others methods, including Trajectory Basis (PTA)[8], Metric Projection[4] and Variational method[14]. As shown in the table, our method achieves competitive performance with the state-of-the-art methods. It is worth noting that our method is pretty easy to implement which only involves a series of least squares.

For dense sequences obtained from real videos, the input 2D video tracks and results obtained by our method are shown in Fig. 4. As shown in the figures, our method outputs reasonable results on the Face and Back sequences, while on the challenging Heart sequence that has both large deformations and small rotation, our result seems to be too flat. This emphasizes the importance of a correct rotation matrix.

Dense input with noise: As stated in previous sections, assuming a smooth 3D surface, our spatial smoothness constraint encourages local smoothness, hence increasing the accuracy and resolution. To evaluate the performance of our method, we added Gaussian noise to the 2D input images, with the standard deviation , where is the noise ratio ranging from 0.01 to 0.05. Each noise settings are repeated for 5 times to obtain statistical results.

Figure  5 shows the performance of our method under different noise ratios on 4 synthetic sequences. It shows that even at large noise ratios, the 3D error of our method is still kept at a low level.

Dense input with outliers: To evaluate the capability of our method in dealing with outliers, we performed experiments with the following settings: a certain amount of points in the video ( points in total) are set at random positions. The outlier ratio varies at 2%, 4%, 6%, 8% and 10%, respectively. We compute the final 3D error by averaging 5 trials, in order to get a statistically accurate result.

Figure  5 illustrates the performance of our method under different outlier ratios. As outlier ratio increases, the 3D error increases slightly, keeping under 0.1 for all synthetic sequences. The error curves are almost linear, which demonstrates the robustness of our method.

## 5 Conclusions

In this paper, we propose a unified framework to dense non-rigid 3D reconstruction, which utilizes both spatial and temporal smoothness to regularize the under-constrained problem. Furthermore, the cost function has been robustified to deal with real world noise and outliers. Our method achieves competitive performance with state-of-the-art dense NRSfM methods. The implementation of our method only involves solving a series of least squares problems, thus making dense NRSfM easy.

## References

• [1] Christoph Bregler, Aaron Hertzmann, and Henning Biermann, “Recovering non-rigid 3D shape from image streams,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2000, pp. 690–696.
• [2] Jing Xiao, Jinxiang Chai, and Takeo Kanade, “A closed-form solution to non-rigid shape and motion recovery,” in Proc. European Conf. Computer Vision, 2004, vol. 3024, pp. 573–587.
• [3] Lorenzo Torresani, Aaron Hertzmann, and Chris Bregler, “Nonrigid structure-from-motion: Estimating shape and motion with hierarchical priors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 878–892, 2008.
• [4] Marco Paladini, Alessio Del Bue, João Xavier, Lourdes Agapito, Marko Stosic, and Marija Dodig, “Optimal metric projections for deformable and articulated structure-from-motion,” Int. J. Comput. Vision, vol. 96, no. 2, pp. 252–276, Jan. 2012.
• [5] Yuchao Dai, Hongdong Li, and Mingyi He, “A simple prior-free method for non-rigid structure-from-motion factorization,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp. 2018–2025.
• [6] Minsik Lee, Jungchan Cho, Chong-Ho Choi, and Songhwai Oh, “Procrustean normal distribution for non-rigid structure from motion,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 1280–1287.
• [7] Tomas Simon, Jack Valmadre, Iain Matthews, and Yaser Sheikh, Separable Spatiotemporal Priors for Convex Reconstruction of Time-Varying 3D Point Clouds, pp. 204–219, Springer International Publishing, Cham, 2014.
• [8] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade, “Trajectory space: A dual representation for nonrigid structure from motion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 7, pp. 1442–1456, July 2011.
• [9] P.F.U. Gotardo and A.M. Martinez, “Non-rigid structure from motion with complementary rank-3 spaces,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011, pp. 3065–3072.
• [10] Yuchao Dai, Hongdong Li, and Mingyi He, “A simple prior-free method for non-rigid structure-from-motion factorization,” International Journal of Computer Vision, vol. 107, no. 2, pp. 101–122, 2014.
• [11] C. Russell, J. Fayad, and L. Agapito, “Dense non-rigid structure from motion,” in International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2012, pp. 509–516.
• [12] Ravi Garg, Anastasios Roussos, and Lourdes Agapito, “A variational approach to video registration with subspace constraints,” International Journal of Computer Vision, pp. 1–29, 2013.
• [13] Chris Russell, Rui Yu, and Lourdes Agapito, “Video pop-up: Monocular 3d reconstruction of dynamic scenes,” in European Conference on Computer Vision. Springer, 2014, pp. 583–598.
• [14] R. Garg, A. Roussos, and L. Agapito, “Dense variational reconstruction of non-rigid surfaces from monocular video,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 1272–1279.
• [15] Katerina Fragkiadaki, Marta Salas, Pablo Arbelaez, and Jitendra Malik, “Grouping-based low-rank trajectory completion and 3d reconstruction,” in Advances in Neural Information Processing Systems 27, pp. 55–63. 2014.
• [16] Rui Yu, Chris Russell, Neill D. F. Campbell, and Lourdes Agapito, “Direct, dense, and deformable: Template-based non-rigid 3d reconstruction from rgb video,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
• [17] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen Koltun, “Dense monocular depth estimation in complex dynamic scenes,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
• [18] Henrik Aans and Fredrik Kahl, “Estimation of deformable structure and motion,” in ECCV Workshop on Vision and Modelling of Dynamic Scenes, 2002, pp. 1–4.
• [19] S. I. Olsen and A. Bartoli, “Implicit non-rigid structure-from-motion with priors,” J. Math. Imaging Vis., vol. 31, no. 2-3, pp. 233–244, July 2008.
• [20] J. Valmadre and S. Lucey, “General trajectory prior for non-rigid reconstruction,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp. 1394–1401.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters