Fast Low-Rank Bayesian Matrix Completion with Hierarchical Gaussian Prior Models
The problem of low-rank matrix completion is considered in this paper. To exploit the underlying low-rank structure of the data matrix, we propose a hierarchical Gaussian prior model, where columns of the low-rank matrix are assumed to follow a Gaussian distribution with zero mean and a common precision matrix, and a Wishart distribution is specified as a hyperprior over the precision matrix. We show that such a hierarchical Gaussian prior has the potential to encourage a low-rank solution. Based on the proposed hierarchical prior model, we develop a variational Bayesian matrix completion method which embeds the generalized approximate massage passing (GAMP) technique to circumvent cumbersome matrix inverse operations. Simulation results show that our proposed method demonstrates superiority over some state-of-the-art matrix completion methods.
atrix completion, low-rank Bayesian learning, generalized approximate massage passing.
The problem of recovering a partially observed matrix, which is referred to as matrix completion, arises in a variety of applications, including recommender systems [1, 2, 3], genotype prediction [4, 5], image classification [6, 7], network traffic prediction , and image imputation . Low-rank matrix completion, which is empowered by the fact that many real-world data lie in an intrinsically low dimensional subspace, has attracted much attention over the past few years. Mathematically, a canonical form of the low-rank matrix completion problem can be presented as
where is an unknown low-rank matrix, is a binary matrix that indicates which entries of are observed, denotes the Hadamard product, and is the observed matrix. It has been shown that the low-rank matrix can be exactly recovered from (1) under some mild conditions . Nevertheless, minimizing the rank of a matrix is an NP-hard problem and no known polynomial-time algorithms exist. To overcome this difficulty, alternative low-rank promoting functionals were proposed. Among them, the most popular alternative is the nuclear norm which is defined as the sum of the singular values of a matrix. Replacing the rank function with the nuclear norm yields the following convex optimization problem
It was proved that the nuclear norm is the tightest convex envelope of the matrix rank, and the theoretical recovery guarantee for (2) under both noiseless and noisy cases was provided in [10, 11, 12, 13]. To solve (2), a number of computationally efficient methods were developed. A well-known method is the singular value thresholding method which was proposed in . Another efficient method was proposed in , in which an augmented Lagrange multiplier technique was employed. Apart from convex relaxation, non-convex surrogate functions, such as the log-determinant function, were also introduced to replace the rank function [16, 17, 18, 19]. Non-convex methods usually claim better recovery performance, since non-convex surrogate functions behaves more like the rank function than the nuclear norm. It is noted that for both convex methods and non-convex methods, one need to meticulously select some regularization parameters to properly control the tradeoff between the matrix rank and the data fitting error when noise is involved. However, due to the lack of the knowledge of the noise variance and the rank, it is usually difficult to determine appropriate regularization parameters.
Another important class of low-rank matrix completion methods are Bayesian methods [20, 21, 22, 23, 24], which model the problem in a Bayesian framework and have the ability to achieve automatic balance between the low-rankness and the fitting error. Specifically, in , the low-rank matrix is expressed as a product of two factor matrices, i.e. , and the matrix completion problem is translated to searching for these two factor matrices and . To encourage a low-rank solution, sparsity-promoting priors  are placed on the columns of two factor matrices, which aims to promote structured-sparse factor matrices with only a few non-zero columns, and in turn leads to a low-rank matrix . Nevertheless, this Bayesian method updates the factor matrices in a row-by-row fashion and needs to perform a number of matrix inverse operations at each iteration. To address this issue, a bilinear generalized approximate message passing (GAMP) method was developed to learn the two factor matrices and [22, 23], without involving any matrix inverse operations. This method, however, cannot automatically determine the matrix rank and needs to try out all possible values of the rank. Recently, a new Bayesian prior model was proposed in , in which columns of the low-rank matrix follow a zero mean Gaussian distribution with an unknown deterministic covariance matrix that can be estimated via Type II maximum likelihood. It was shown that maximizing the marginal likelihood function yields a low-rank covariance matrix, which implies that the prior model has the ability to promote a low-rank solution. A major drawback of this method is that it requires to perform an inverse of an matrix at each iteration, and thus has a cubic complexity in terms of the problem size. This high computational cost prohibits its application to many practical problems.
In this paper, we develop a new Bayesian method for low-rank matrix completion. To exploit the underlying low-rank structure of the data matrix, a low-rank promoting hierarchical Gaussian prior model is proposed. Specifically, columns of the low-rank matrix are assumed to be mutually independent and follow a common Gaussian distribution with zero mean and a precision matrix. The precision matrix is treated as a random parameter, with a Wishart distribution specified as a hyperprior over it. We show that such a hierarchical Gaussian prior model has the potential to encourage a low-rank solution. The GAMP technique is employed and embedded in the variational Bayesian (VB) inference, which results in an efficient VB-GAMP algorithm for matrix completion. Note that due to the non-factorizable form of the prior distribution, the GAMP technique cannot be directly used. To address this issue, we construct a carefully devised surrogate problem whose posterior distribution is exactly the one required for VB inference. Meanwhile, the surrogate problem has factorizable prior and noise distributions such that the GAMP can be directly applied to obtain an approximate posterior distribution. Such a trick helps achieve a substantial computational complexity reduction, and makes it possible to successfully apply the proposed method to solve large-scale matrix completion problems.
The rest of the paper is organized as follows. In Section 2, we introduce a hierarchical Gaussian prior model for low-rank matrix completion. Based on this hierarchical model, a variational Bayesian method is developed in Section 3. In Section 4, a GAMP-VB method is proposed to reduce the computational complexity of the proposed algorithm. Simulation results are provided in Section 5, followed by concluding remarks in Section 6.
2 Bayesian Modeling
In the presence of noise, the canonical form of the matrix completion problem can be formulated as
where denotes the additive noise, and is a binary matrix that indicates which entries are observed. Without loss of generality, we assume . As indicated earlier, minimizing the rank of a matrix is an NP-hard problem. In this paper, we consider modeling the matrix completion problem within a Bayesian framework.
We assume entries of are independent and identically distributed (i.i.d.) random variables following a Gaussian distribution with zero mean and variance . To learn , a Gamma hyperprior is placed over , i.e.
where is the Gamma function. The parameters and are set to small values, e.g. , which makes the Gamma distribution a non-informative prior.
To promote the low-rankness of , we propose a two-layer hierarchical Gaussian prior model (see Fig. 1). Specifically, in the first layer, the columns of are assumed mutually independent and follow a common Gaussian distribution:
where denotes the th column of , and is the precision matrix. The second layer specifies a Wishart distribution as a hyperprior over the precision matrix :
where and denote the degrees of freedom and the scale matrix of the Wishart distribution, respectively. Note that the constraint for the standard Wishart distribution can be relaxed to if an improper prior is allowed, e.g. . In Bayesian inference, improper prior distributes can often be used provided that the corresponding posterior distribution can be correctly normalized .
The Gaussian-inverse Wishart prior has the potential to encourage a low-rank solution. To illustrate this low-rankness promoting property, we integrate out the precision matrix and obtain the marginal distribution of as (details of the derivation can be found in Appendix 7)
From (7), we have
If we choose , and let be a small positive value, the log-marginal distribution becomes
where denotes the th eigenvalue of . Clearly, in this case, the prior encourages a low-rank solution . This is because maximizing the prior distribution is equivalent to minimizing with respect to . It is well known that the log-sum function is an effective sparsity-promoting functional which encourages a sparse solution of [28, 29, 30]. As a result, the resulting matrix has a low-rank structure.
In addition to , the parameter can otherwise be devised in order to exploit additional prior knowledge about . For example, in some applications such as image inpainting, there is a spatial correlation among neighboring coefficients of . To capture the smoothness between neighboring coefficients, can be set as 
where is a second-order difference operator with its th entry given by
Another choice of to promote a smooth solution is the Laplacian matrix , i.e.
where is the adjacency matrix of a graph with its entries given by
, referred to as the degree matrix, is a diagonal matrix with , and is a small positive value to ensure to be full rank.
For a positive-definite matrix , the following equality holds valid
for any .
See Appendix 8.
From Lemma 1, we have
where is the th eigenvalue associated with . We see that maximizing the prior distribution is equivalent to minimizing with respect to . As discussed earlier, this log-sum functional is a sparsity-promoting functional which encourages a sparse solution . As a result, the matrix has a low rank. Since is full rank, this implies that has a low-rank structure. On the other hand, notice that is the first-order approximation of . Therefore minimizing will reduce the value of . Clearly, for defined in (10) and (12), a smoother solution results in a smaller value of . Therefore when is chosen to be (10) or (12), the resulting prior distribution has the potential to encourage a low-rank and smooth solution.
Remarks: Our proposed hierarchical Gaussian prior model can be considered as a generalization of the prior model in . Notice that in , the precision matrix in the prior model is assumed to be a deterministic parameter, whereas it is treated as a random variable and assigned a Wishart prior distribution in our model. This generalization offers more flexibility in modeling the underlying latent matrix. As discussed earlier, the parameter can be devised to capture additional prior knowledge about the latent matrix, and such a careful choice of can help substantially improve the recovery performance, as corroborated by our experimental results.
3 Variational Bayesian Inference
3.1 Review of The Variational Bayesian Methodology
Before proceeding, we firstly provide a brief review of the variational Bayesian (VB) methodology. In a probabilistic model, let and denote the observed data and the hidden variables, respectively. It is straightforward to show that the marginal probability of the observed data can be decomposed into two terms 
where is any probability density function, is the Kullback-Leibler divergence  between and . Since , it follows that is a rigorous lower bound for . Moreover, notice that the left hand side of (16) is independent of . Therefore maximizing is equivalent to minimizing , and thus the posterior distribution can be approximated by through maximizing .
The significance of the above transformation is that it circumvents the difficulty of computing the posterior probability , when it is computationally intractable. For a suitable choice for the distribution , the quantity may be more amiable to compute. Specifically, we could assume some specific parameterized functional form for and then maximize with respect to the parameters of the distribution. A particular form of that has been widely used with great success is the factorized form over the component variables in , i.e. . We therefore can compute the posterior distribution approximation by finding of the factorized form that maximizes the lower bound . The maximization can be conducted in an alternating fashion for each latent variable, which leads to 
where denotes the expectation with respect to the distributions for all . By taking the logarithm on both sides of (19), it can be equivalently written as
3.2 Proposed Algorithm
We now proceed to perform variational Bayesian inference for the proposed hierarchical model. Let denote all hidden variables. Our objective is to find the posterior distribution . Since is usually computationally intractable, we, following the idea of , approximate as which has a factorized form over the hidden variables , i.e.
As mentioned in the previous subsection, the maximization of can be conducted in an alternating fashion for each latent variable, which leads to (details of the derivation can be found in )
where denotes the expectation with respect to (w.r.t.) the distributions . Details of this Bayesian inference scheme are provided next.
1). Update of : The calculation of can be decomposed into a set of independent tasks, with each task computing the posterior distribution approximation for each column of , i.e. . We have
where denotes the th column of and , with being the th column of . From (22), it can be seen that follows a Gaussian distribution
with and given as
We see that to calculate , we need to perform an inverse operation of an matrix which involves a computational complexity of .
2). Update of : The approximate posterior can be obtained as
From (26), it can be seen that follows a Wishart distribution, i.e.
3). Update of : The variational optimization of yields
where and denote the th entry of and , respectively, is an index set consisting of indices of those observed entries, and is the cardinality of the set , in which denotes the th entry of .
It is easy to verify that follows a Gamma distribution
with the parameters and given respectively by
Some of the expectations and moments used during the update are summarized as
where denotes the th diagonal entry of .
For clarity, we summarize our algorithm as follows.
It can be easily checked that the computational complexity of our proposed method is dominated by the update of the posterior distribution , which requires computing an matrix inverse times and therefore has a computational complexity scaling as . This makes the application of our proposed method to large data sets impractical. To address this issue, in the following, we develop a computationally efficient algorithm which obtains an approximation of by resorting to the generalized approximate message passing (GAMP) technique .
GAMP is a low-complexity Bayesian iterative technique recently developed in [36, 35] for obtaining approximate marginal posteriors. Note that the GAMP algorithm requires that both the prior distribution and the noise distribution have factorized forms . Nevertheless, in our model, the prior distribution has a non-factorizable form, in which case the GAMP technique cannot be directly applied. To address this issue, we first construct a surrogate problem which aims to recover from linear measurements :
where is obtained by performing a singular value decomposition of , is a unitary matrix and is a diagonal matrix with its diagonal elements equal to the singular values of , and denotes the additive Gaussian noise with zero mean and covariance matrix . We assume that entries of are mutually independent and follow the following distribution:
where , , and denote the th entry of , , and , respectively, is a constant, , and are known parameters. It is noted that although is an improper prior distribution, it can often be used provided the corresponding posterior distribution can be correctly normalized . Considering the surrogate problem (37), the posterior distribution of can be calculated as
Taking the logarithm of , we have
where is a diagonal matrix with its th diagonal entry equal to . Clearly, follows a Gaussian distribution with its mean and covariance matrix given by
Comparing (24)–(25) with (41)–(42), we can readily verify that when , , (i.e. ), and , is exactly the desired posterior distribution . Meanwhile, notice that for the surrogate problem (37), both the prior distribution and the noise distribution are factorizable. Hence the GAMP algorithm can be directly applied to (37) to find an approximation of the posterior distribution . By setting , , , , an approximate of in (23) can be efficiently obtained. We now proceed to derive the GAMP algorithm for the surrogate problem (37).
4.1 Solving (37) via GAMP
GAMP was developed in a message passing-based framework. By using central-limit-theorem approximations, message passing between variable nodes and factor nodes can be greatly simplified, and the loopy belief propagation on the underlying factor graph can be efficiently performed. As noted in , the central-limit-theorem approximations become exact in the large-system limit under an i.i.d. zero-mean sub-Gaussian measurement matrix.
Firstly, GAMP approximates the true marginal posterior distribution by
where and are quantities iteratively updated during the iterative process of the GAMP algorithm. Here, we have dropped their explicit dependence on the iteration number for simplicity. For the case , substituting the prior distribution (38) into (43), it can be easily verified that the approximate posterior follows a Gaussian distribution with its mean and variance given respectively as
Another approximation is made to the noiseless output , where denotes the th row of . GAMP approximates the true marginal posterior by
where and are quantities iteratively updated during the iterative process of the GAMP algorithm. Again, here we dropped their explicit dependence on the iteration number . Under the additive white Gaussian noise assumption, we have , where denotes the th diagonal element of . Thus also follows a Gaussian distribution with its mean and variance given by
With the above approximations, we can now define the following two scalar functions: and that are used in the GAMP algorithm. The input scalar function is simply defined as the posterior mean , i.e.
The scaled partial derivative of with respect to is the posterior variance , i.e.
The output scalar function is related to the posterior mean as follows
The partial derivative of is related to the posterior variance in the following way
Given the above definitions of and , the GAMP algorithm tailored to the considered problem (37) can now be summarized as follows (details of the derivation of the GAMP algorithm can be found in ), in which denotes the th entry of .
We have now derived an efficient algorithm to obtain an approximate posterior distribution of for (37). Specifically, the true marginal posterior distribution of is approximated by a Gaussian distribution with its mean and variance given by (44)–(45) or (46)–(47), depending on the value of . The joint posterior distribution can be approximated as a product of approximate marginal posterior distributions:
As indicated earlier, by setting , , , and , the posterior distribution obtained via the GAMP algorithm can be used to approximate in (23).
We see that to approximate by using the GAMP, we first need to perform a singular value decomposition (SVD) of , which has a computational complexity of . The GAMP algorithm used to approximate involves very simple matrix-vector multiplications which has a computational complexity scaling as . Therefore the overall computational complexity for updating is of order . In contrast, using (24)–(25) to update requires a computational complexity of . Thus the GAMP technique can help achieve a significant reduction in the computational complexity as compared with a direct calculation of .
For clarity, the VB-GAMP algorithm for matrix completion is summarized as
The proposed method proceeds in a double-loop manner, the outer loop calculate the variational posterior distributions and , and the inner loop computes an approximation of . It is noted that there is no need to wait until the GAMP converges. Experimental results show that GAMP provides a reliable approximation of even if only a few iterations are performed. In our experiments, only one iteration is used to implement GAMP.
In this section, we carry out experiments to illustrate the performance of our proposed GAMP-assisted Bayesian matrix completion method with hierarchical Gaussian priors (referred to as BMC-GP-GAMP). Throughout our experiments, the parameters used in our method are set to be and . Here we choose a small in order to encourage a low-rank precision matrix. We compare our method with several state-of-the-art methods, namely, the variational sparse Bayesian learning method (also referred to as VSBL)  which models the low-rankness of the matrix as the structural sparsity of its two factor matrices, the bilinear GAMP-based matrix completion method (also referred to as BiGAMP-MC)  which implements the VSBL using bilinear GAMP, the inexact version of the augmented Lagrange multiplier based matrix completion method (also referred to ALM-MC) , and a low-rank matrix fitting method (also referred to as LMaFit)  which iteratively minimizes the fitting error and estimates the rank of the matrix. It should be noted that both VSBL and LMaFit require to set an over-estimated rank. Note that we did not include  in our experiments due to its prohibitive computational complexity when the matrix dimension is large. Codes of our proposed algorithm along with other competing algorithms are available at http://www.junfang-uestc.net/codes/LRMC.rar, in which codes of other competing algorithms are obtained from their respective websites.
5.1 Synthetic data
We first examine the effectiveness of our proposed method on synthetic data. We generate the test rank- matrix of size by multiplying by the transpose of , i.e. . All the entries of and are sampled from a normal distribution. We consider the scenarios where () and () entries of are observed. Here denotes the sampling ratio. The success rates as well as the run times of respective algorithms as a function of the rank of , i.e. , are plotted in Fig. 2 Results are averaged over 25 independent trials. A trial is considered to be successful if the relative error is smaller than , i.e. , where denotes the estimated matrix. For our proposed method, the matrix parameter is set to . The pre-defined overestimated rank for VSBL and LMaFit is set to be twice the true rank. For the case , VSBL and BiGAMP-MC present the same recovery performance with their curves overlapping each other. From Fig. 2, we can see that
Our proposed method presents the best performance for both sampling ratio cases. Meanwhile, it has a moderate computational complexity. When the sampling ratio is set to 0.5, our proposed method has a run time similar to the ALM-MC method, while provides a clear performance improvement over the ALM-MC.
The LMaFit method is the most computationally efficient. But its performance is not as good as our proposed method.
The proposed method outperforms the other two Bayesian methods, namely, the VSBL and the BiGAMP-MC, by a big margin in terms of both recovery accuracy and computational complexity. Since the BiGAMP cannot automatically determine the matrix rank, it needs to try all possible values of the rank, which makes running the BiGAMP-MC time costly.
5.2 Gene data
We carry out experiments on gene data for genotype estimation. The dataset  is a matrix of size provided by Wellcome Trust Case Control Consortium (WTCCC) and contains the genetic information from chromosome 22. The dataset, which is referred to as “Chr22”, has been shown in  to be approximately low-rank. We randomly select or of the entries of the dataset as observations, and recover the rest entries using low-rank matrix completion methods. Again, for our proposed method, the matrix parameter is set to . The pre-defined ranks used for VSBL and LMaFit are both set to . Following , we use a metric termed as “allelic-imputation error rate” to evaluate the performance of respective methods. The error rate is defined as
where and denotes the true and the estimated matrices, respectively, the operation returns a matrix with each entry of rounded to its nearest integer, counts the number of non-zero entries of , and denotes the number of unobserved entries. We report the average error rates of respective algorithms in Table 1. From Table 1, we see that all methods, except the LMaFit method, present similar results and the proposed method slightly outperforms other methods when entries are observed. Despite the superior performance on synthetic data, the LMaFit method incurs large estimation errors for this dataset.
5.3 Collaborative Filtering
In this experiment, we study the performance of respective methods
on the task of collaborative filtering. We use the MovieLens 100k
where is a set containing the indexes of those unobserved available ratings, and denote the maximum and minimum ratings, respectively. The results of NMAE are shown in Table 2, from which we see that the proposed method achieves the most accurate rating prediction when the number of observed ratings is small.
5.4 Image Inpainting
Lastly, we evaluate the performance of different methods on image inpainting. The objective of image inpainting is to complete an image with missing pixels. We conduct experiments on the benchmark images Butterfly and Lena, which are of size and , respectively. In our experiments, we examine the performance of our proposed method under different choices of . As usual, we can set . Such a choice of is referred to as BMC-GP-GAMP-I. We can also set according to (10) and (12), which are respectively referred to as BMC-GP-GAMP-II and BMC-GP-GAMP-III. The parameters and in (12) are set to and , respectively. As discussed earlier in our paper, the latter two choices exploit both the low-rankness and the smoothness of the signal. For the Butterfly image, we consider cases where and of pixels in the image are observed. For the Lena image, we consider cases where and of pixels are observed. We report the peak signal to noise ratio (PSNR) as well as the structural similarity (SSIM) index of each algorithm in Table 3. The original image with missing pixels and these images reconstructed by respective algorithms are shown in Fig. 3, 4, 5, and 6. From Table 3, we see that with a common choice of , our proposed method, BMC-GP-GAMP-I, outperforms other methods in most cases. When is more carefully devised, our proposed method, i.e. BMC-GP-GAMP-II and BMC-GP-GAMP-III, surpasses other methods by a substantial margin in terms of both PSNR and SSIM metrics. This result indicates that a careful choice of that captures both the low-rank structure as well as the smoothness of the latent matrix can help substantially improve the recovery performance. From the reconstructed images, we also see that our proposed method, especially BMC-GP-GAMP-II and BMC-GP-GAMP-III, provides the best visual quality among all these methods.
The problem of low-rank matrix completion was studied in this paper. A hierarchical Gaussian prior model was proposed to promote the low-rank structure of the underlying matrix, in which columns of the low-rank matrix are assumed to be mutually independent and follow a common Gaussian distribution with zero mean and a precision matrix. The precision matrix is treated as a random parameter, with a Wishart distribution specified as a hyperprior over it. Based on this hierarchical prior model, we developed a variational Bayesian method for matrix completion. To avoid cumbersome matrix inverse operations, the GAMP technique was used and embedded in the variational Bayesian inference, which resulted in an efficient VB-GAMP algorithm. Empirical results on synthetic and real datasets show that our proposed method offers competitive performance for matrix completion, and meanwhile achieves a significant reduction in computational complexity.
We provide a detailed derivation of (7). We have