Bayesian methods for lowrank matrix estimation: short survey and theoretical study
Abstract
This is the corrected version of a paper that was published as
P. Alquier, Bayesian methods for lowrank matrix estimation: short survey and theoretical study, Algorithmic Learning Theory, 2013, S. Jain, R. Munos, F. Stephan and T. Zeugmann Eds., Springer  Lecture Notes in Artificial Intelligence n. 8139, pp. 309323.
Since then, a mistake was found in the proofs. We fixed the mistake at the price of a minor change in the final result. ^{1}^{1}1We thank Arnak Dalalyan (ENSAE) who found the mistake. ^{2}^{2}2At the time of the original publication, the author was a lecturer at UCD Dublin (School of Mathematical Sciences).
The problem of lowrank matrix estimation recently received a lot of attention due to
challenging applications. A lot of work has been done on rankpenalized methods [1]
and convex relaxation [2], both on the theoretical and
applied sides. However, only a few papers considered Bayesian estimation.
In this paper, we review the different type of priors considered on matrices to favour
lowrank. We also prove that the obtained Bayesian estimators, under suitable
assumptions, enjoys the same optimality properties as the ones based on penalization.
Keywords: Bayesian inference, collaborative filtering, reducedrank regression,
matrix completion, PACBayesian bounds, oracle inequalities.
Pierre Alquier
1 Introduction
The problem of lowrank matrix estimation recently received a lot of attention, due to challenging highdimensional applications provided by recommender systems, see e.g. the NetFlix challenge [3]. Depending on the application, several different models are studied: matrix completion [2], reducedrank regression [4], trace regression, e.g. [5], quantum tomogaphy, e.g. [6], etc.
In all the above mentionned papers, the authors considered estimators obtained by minimizing a criterion that is the sum of two terms: a measure of the quality of data fitting, and a penalization term that is added to avoid overfitting. This term is usually the rank of the matrix, as in [1], or, for computational reasons, the nuclear norm of the matrix, as in [2] (the nuclear norm is the sum of the absolute values of the singular values, it can be seen as a matrix equivalent of the vectors norm). However, it is to be noted that only a few papers considered Bayesian methods: we mention [7] for a first study of reducedrank regression in Bayesian econometrics, and more recently [8, 9, 10] for matrix completion and reducedrank regression (a more exhaustive bibliography is given below).
The objective of this paper is twofold: first, in Section 2 we provide a short survey of the priors that have been effectively used in various problems of lowrank estimation. We focus on two models, matrix completion and reduced rankregression, but all the priors can be used in any model involving lowrank matrix estimation.
Then, in Section 3 we prove a theoretical result on the Bayesian estimator in the context of reduced rank regression. It should be noted that for some appropriate choice of the hyperparameters, the rate of convergence is the same as for penalized methods, up to terms. The theoretical study in the context of matrix completion will be the object of a future work.
2 Model and priors
In this section, we briefly introduce two models: reduced rank regression, 2.1, and matrix completion, 2.2. We then review the priors used in these models, 2.3.
2.1 Reduced rank regression
In the matrix regression model, we observe two matrices and with
where is an deterministic matrix, is a deterministic matrix and is an random matrix with . The objective is to estimate the parameter matrix . This model is sometimes refered as multivariate linear regression, matrix regression or multitask learning. In many applications, it makes sense to assume that the matrix has low rank, i.e. . In this case, the model is known as reduced rank regression, and was studied as early as [11, 12]. We refer the reader to the monograph [4] for a complete introduction.
Depending on the application the authors have in mind, additional assumptions on the distribution of the noise matrix are used:

the entries of are i.i.d., and the probability distribution of is bounded, subGaussian or Gaussian . In this case, note that the likelihood of any matrix is given by
where we let denote the Frobenius norm, .

as a generalization of the latter case, it is often assumed in econometrics papers that the rows of are i.i.d. for some variancecovariance matrix .
In order to estimate , we have to specify a prior on and, depending on the assumptions on , a prior on or on . Note however that in most theoretical papers, it is assumed that is known, or can be upper bounded, as in [1]. This assumption is clearly a limitation but it makes sense in some applications: see e.g. [6] for quantum tomography (that can bee seen as a special case of reduced rank regression).
2.2 Matrix completion
In the problem of matrix completion, one observes entries of an matrix for in a given set of indices . Here again, the noise matrix satisfies and the objective is to recover under the assumption that . Note that under the assumption that the are i.i.d. , the likelihood is given by
In [2], this problem is studied without noise (i.e. ), the general case is studied among others in [14, 15, 16].
Note that recently, some authors studied the trace regression model, that includes linear regression, reducedrank regression and matrix completion as special cases: see [17, 18, 5, 19]. Up to our knowledge, this model has not been considered from a Bayesian perspective until now, so we will mainly focus on reduced regression and matrix completion in this paper. However, all the priors defined for reducedrank regression can also be used for the more general trace regression setting.
2.3 Priors on (approximately) lowrank matrices
It appears that some econometrics models can actually be seen as special cases of the reduced rank regression. Some of them were studied from a Bayesian perspetive from the seventies, to our knowledge, it was the first Bayesian study of a reduced rank regression:
The first systematic treatment of the reduced rank model from a Bayesian perspective was carried out in [7]. The idea of this paper is to write the matrix parameter as for two matrices and respectively and , and to give a prior on and rather than on . Note that the rank of is in any case smaller than . So, to choose imposes a low rank structure to the matrix .
The prior in [7] is given by
where is a Gaussian shrinkage on all the entries of the matrices:
for some parameter . Then, is an dimensional inverseWishart distribution with degrees of freedom and matrix parameter , :
Remark that this prior is particularly convenient as it is then possible to give explicit forms for the marginal posteriors. This allows an implementation of the Gibbs algorithm to sample from the posterior. As the formulas are a bit cumbersome, we do not provide them here, however, the interested reader can find them in [7].
The weak point in this approach is that the question of the choice of the reduced rank is not addressed. It is possible to estimate and for any possible and to use Bayes factors for model selection, as in [26]. Numerical approximation and assessment of convergence for this method are provided by [27].
A more recent approach consists in fixing a large , as , and then in calibrating the prior so that it would naturally favour matrices with rank smaller than (or, really close to such matrices). To our knowledge, the first attempt in this direction is [8]. Note that this paper was actually about matrix completion rather than reduced rank regression, but once again, all the priors in this subsection can be used in both settings. Here again, we write , and
In other words, if we write and , then the and are independent and respectively and where is the indentity matrix of size . In order to understand the idea behind this prior, assume for one moment that and are large for and very small for . Then, for , and have entries close to , and so . So, the matrix
a matrix that has a rank at most . In practice, the choice of the ’s and ’s is the main difficulty of this approach. Based on a heuristic, the authors proposed an estimation of these quantities that seems to perform well in practice. Remark that the authors assume that the are independent and the parameter is not modelled in the prior (but is still estimated on the data). They finally propose a variational Bayes approach to approximate the posterior.
Very similar priors were used by [9] and in the PMF method (Probabilistic Matrix Factorisation) of [10]. However, improved versions were proposed in [28, 29, 30, 31]: the authors proposed a full Bayesian treatment of the problem by putting priors on the hyperparameters. We describe more precisely the prior in [28]: the and are independent and respectively and , and then: , , and finally . Here again, the hyperparameters , and are to be specified. The priors in [29, 30] are quite similar, and we give more details about the one in [30] in Section 3. In [10, 28, 29, 30], the authors simulate from the posterior thanks to the Gibbs sampler (the posterior conditional distribution are explicitely provided e.g. in [28]). Alternatively, [9] uses a stochastic gradient descent to approximate the MAP (maximum a posteriori).
Some papers proposed a kernelized version of the reduced rank regression and matrix completion models. Let denote the th row of and the th row of . Then, leads to . We can replace this relation by
for some RKHS Kernel . In [32], the authors propose a Bayesian formulation of this model: is seen as a Gaussian process on with expectation zero and covariance function related to the kernel . The same idea is refined in [33] and applied successfully to very large datasets, including the NetFlix challenge dataset, thanks to two algorithms: the Gibbs sampler, and the EM algorithm to approximate the MAP.
Finally, we want to mention the nice theoretical work [34, 35]: in these papers, the authors study the asymptotic performance of Bayesian estimators in the reduced rank regression model under a general prior that has a compactly supported and infinitely differentiable density. Clearly, the priors aforementioned do not fit the compact support assumption. The question wether algorithmically tractable priors fit this assumption is, to our knowledge, still open. In Section 3, we propose a nonasymptotic analysis of the prior of [30].
3 Theoretical analysis
In this section, we provide a theoretical analysis of the Bayesian estimators obtained by using the idea of hierarchical priors of [28, 29, 30, 31]. More precisely, we use exactly the prior of [30] and provide a theoretical result on the performance of the estimator in the reducedrank regression model.
Several approaches are available to study the performance of Bayesian estimators: the asymptotic approach based on BernsteinvonMises type theorems, see Chapter 10 in [36], and a nonasmptotic approach based on PACBayesian inequalities. PACBayesian inequalities were introduced for classification by [37, 38] but tighter bounds and extentions to regression estimation can be found in [39, 40, 41, 42, 43]. In all approaches, the variance of the noise is assumed to be known or at least upperbounded by a given constant, so we use this framework here. To our knowledge, this is the first application of PACBayesian bounds to a matrix estimation problem.
3.1 Theorem
Following [30] we write where is , is , and then
for some diagonal matrix
the are i.i.d. and :
where
We will make one of the following assumptions on the noise:

Assumption (A1): the entries of are i.i.d. , and we know an upper bound for .

Assumption (A2): the entries of are iid according to any distribution supported by the compact interval with a density w.r.t. the Lebesgue measure and , and we know an upper bound .
Note that (A1) and (A2) are special case of the one in [41], the interested reader can replace these assumptions by the more technical condition given in [41]. We define
where is the probability distribution given by
Note that in the case where the entries of are i.i.d. then this is the Bayesian posterior, , when , and so is the expectation under the posterior. However, for theoretical reasons, we have to consider slightly smaller to prove theoretical results.
Theorem 1
Assume that either (A1) or (A2) is satisfied. Let us put and in the prior . For we have, for any ,
Remark 1
Note that when all the entries of satisfy for some , . Moreover, let us assume that and that we can write with and and . Assume that the noise is Gaussian. Taking for example we get
where we remind that . When , we can see that we recover the same upper bound as in [1], up to a term. This rate (without the ) is known to be optimal, see [1] remark (ii) p. 1293 and [17]. However, the presence of the terms and can lead to suboptimal rates in less classical asymptotics where would grow with the sample size . In the case of linear regression, a way to avoid these terms is to use heavytailed priors as in [41, 42], or compactly supported priors as in [44]. However, it is not clear whether this approach would lead to feasible algorithms in matrix estimation problems. This question will be the object of a future work.
Remark 2
We do not claim that the choice is optimal in practice. However, from the proof it is clear that our technique requires that decreases with the dimension of as well as with the sample size to produce a meaningfull bound. Note that in [30], there is no theoretical approach for the choice of , but their simulation study tends to show that must be very small for to be approximately lowrank.
Remark 3
In all the above mentionned papers on PACBayesian bounds, it is assumed that the variance of the noise is known, or upperbounded by a known constant. More recently, [45] managed to prove PACBayesian inequalities for regression with unknown variance. However, the approach is rather involved and it is not clear whether it can be used in our context. This question will also be addressed in a future work.
3.2 Proof
First, we state the following result:
Theorem 2
Under (A1) or (A2), for any , we have
where stands for the Kullback divergence between and , if is absolutely continuous with respect to and otherwise.
Proof of Theorem 2. Follow the proof of Theorem 1 in [41] and check that every step is valid when is a matrix instead of a vector.
We are now ready to prove our main result.
Proof of Theorem 1. Let us introduce, for any , the probability distribution . According to Theorem 2 we have
(1) 
Let us fix , and . The remaining steps of the proof are to upperbound the two terms in the r.h.s. Both upper bounds will depend on , we will optimize on after these steps to end the proof. We have, for any ,
(2)  
(3) 
Now, we deal with the second term:
We remind that and and let us denote the subset of such that for . We let denote the cardinality of , . Note that we have . For any let be the event
Then
(4) 
By symmetry, we will only bound the first of these two terms. We have
(5) 
We lowerbound the three factors in the integral in (5) separately. First, note that, on ,
(6) 
On , and for :
where is the c.d.f. of . We use the classical inequality
to get:
and finally
(7) 
Then, on , and for :
and so
(8) 
We plug (6), (7) and (8) into (5) and we obtain:
Now, let us impose the following restrictions: so the last factor is . So we have:
So,
(9) 
By symmetry,
(10) 
and finally, plugging (9) and (10) into (4)
(11) 
Finally, we can plug (3) and (11) into (1):
Let us put to get:
Finally, remember that the conditions of the theorem impose that , and . However, we used until now that , that , that , and that . Remember that so all these equations are compatible. We obtain:
This ends the proof.
4 Conclusion
We proved that the use of Gaussian priors in reducedrank regression models leads to nearly optimal rates of convergence. As mentionned in the paper, alternative priors would possibly lead to better bounds but could also result in less computationaly efficient methods (computational efficiency is a major issue when dealing with highdimensional datasets such as the NetFlix dataset). A complete exploration of this issue will be addressed in future works.
References
 [1] Bunea, F., She, Y., Wegkamp, M.H.: Optimal selection of reduced rank estimators of highdimensional matrices. The Annals of Statistics 39 (2011) 1282–1309
 [2] Candès, E., Tao, T.: The power of convex relaxation: Nearoptimal matrix completion. IEEE Transactions on Information Theory 56 (2009) 2053–2080
 [3] Bennett, J., Lanning, S.: The netflix prize. In: Proceedings of KDD Cup and Workshop 07. (2007)
 [4] Reinsel, G.C., Velu, R.P.: Multivariate reducedrank regression: theory and applications. Springer Lecture Notes in Statistics 136 (1998)
 [5] Koltchinskii, V., Lounici, K., Tsybakov, A.B.: Nuclearnorm penalization and optimal rates for noisy lowrank matrix completion. The Annals of Statistics 39 (2011) 2302–2329
 [6] Alquier, P., Butucea, C., Hebiri, M., Meziani, K., Morimae, T.: Rankpenalized estimation of a quantum system. Preprint arXiv:1206.1711 (2012)
 [7] Geweke, J.: Bayesian reduced rank regression in econometrics. Journal of Econometrics 75 (1996) 121–146
 [8] Lim, Y.J., Teh, Y.W.: Variational bayesian approach to movie rating prediction. In: Proceedings of KDD Cup and Workshop 07. (2007)
 [9] Lawrence, N.D., Urtasun, R.: Nonlinear matrix factorization with Gaussian processes. In: Proceedings of the 26th annual International Conference on Machine Learning (ICML09), ACM, New York (2009) 601–608
 [10] Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization. In Platt, J.C., Koller, D., Singer, Y., Roweis, S., eds.: Advances in Neural Information Processing Systems 20 (NIPS2007), Cambridge, MIT Press (2008)
 [11] Anderson, T.: Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics 22 (1951) 327–351
 [12] Izenman, A.: Reduced rank regression for the multivariate linear model. Journal of Multivariate Analysis 5 (1975) 248–264
 [13] Yuan, M., Ekici, A., Lu, Z., Monteiro, R.: Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society  Series B 69 (2007) 329–346
 [14] Candès, E., Plan, Y.: Matrix completion with noise. Proceedings of the IEEE 98 (2009) 625–636
 [15] Candès, E., Recht, B.: Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9 (2009) 717–772
 [16] Gross, D.: Recovering lowrank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57 (2011) 1548–1566
 [17] Rohde, A., Tsybakov, A.B.: Estimation of highdimensional lowrank matrices. The Annals of Statistics 39 (2011) 887–930
 [18] Klopp, O.: Rankpenalized estimators for highdimensionnal matrices. Electronic Journal of Statistics 5 (2011) 1161–1183
 [19] Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer Lecture Notes in Mathematics (2011)
 [20] Dreze, J.H.: Bayesian limited information analysis of the simultaneous equation model. Econometrica 44 (1976) 1045–1075
 [21] Dreze, J.H., Richard, J.F.: Bayesian analysis of simultaneous equation models. In Griliches, Z. ans Intriligater, J.F., ed.: Handbook of econometrics, vol. 1, NorthHolland, Amsterdam (1983)
 [22] Zellner, A., Min, C., Dallaire, D.: Bayesian analysis of simultaenous equation and related models using the Gibbs sampler and convergence checks. H. G. B. Alexander Research Founsation working paper, University of Chicago (1993)
 [23] Kleibergen, F., van Dijk, H.K.: Bayesian simultaneous equation analysis using reduced rank structures. Econometric theory 14 (1998) 699–744
 [24] Bauwens, L., Lubrano, M.: Identification restriction and posterior densities in cointegrated gaussian var systems. In Fomby, T.M., Carter Hill, R., eds.: Advances in econometrics, vol. 11(B), JAI Press, Greenwich (1993)
 [25] Kleibergen, F., van Dijk, H.K.: On the shape of the likelihoodposterior in cointegration models. Econometric theory 10 (1994) 514–551
 [26] Kleibergen, F., Paap, R.: Priors, posteriors and bayes factors for a bayesian analysis of cointegration. Journal of Econometrics 111 (2002) 223–249
 [27] Corander, J., Villani, M.: Bayesian assessment of dimensionality in reduced rank regression. Statistica Neerlandica 58 (2004) 255–270
 [28] Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using markov chain monte carlo. In: Proceedings of the 25th annual International Conference on Machine Learning (ICML08), ACM, New York (2008)
 [29] Zhou, M., Wang, C., Chen, M., Paisley, J., Dunson, D., Carin, L.: Nonparametric bayesian matrix completion. In: IEEE Sensor Array and Multichannel Signal Processing Workshop. (2010)
 [30] Babacan, S.D., Luessi, M., Molina, R., Katsaggelos, A.K.: Lowrank matrix completion by variational sparse bayesian learning. In: IEEE International Conference on Audio, Speech and Signal Processing, Prague (Czech Republic) (2011) 2188–2191
 [31] Paisley, J., Carin, L.: A nonparametric bayesian model for kernel matrix completion. In: Proceedings of ICASSP 2010, Dallas, USA. (2010)
 [32] Yu, K., Tresp, V., Schwaighofer, A.: Learning Gaussian processes for multiple tasks. In: Proceedings of the 22th annual International Conference on Machine Learning (ICML05). (2005)
 [33] Yu, K., Lafferty, J., Zhu, S., Gong, Y.: Largescale collaborative prediction using a nonparametric random effects model. In: Proceedings of the 26th annual International Conference on Machine Learning (ICML09), ACM, New York (2009)
 [34] Aoyagi, M., Watanabe, S.: The generalization error of reduced rank regression in bayesian estimation. In: International Symposium on Information Theory and its Applications (ISITA2004), Parma (Italy) (2004)
 [35] Aoyagi, M., Watanabe, S.: Stochastic complexities of reduced rank regression in bayesian estimation. Neural networks 18 (2005) 924–933
 [36] van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press (1998)
 [37] ShaweTaylor, J., Williamson, R.: A PAC analysis of a Bayes estimator. In: Proceedings of the Tenth Annual Conference on Computational Learning Theory, New York, ACM (1997) 2–9
 [38] McAllester, D.A.: Some pacbayesian theorems. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory (Madison, WI, 1998), ACM (1998) 230–234
 [39] Catoni, O.: Statistical Learning Theory and Stochastic Optimization. Springer Lecture Notes in Mathematics (2004)
 [40] Catoni, O.: PACBayesian Supervised Classification (The Thermodynamics of Statistical Learning). Volume 56 of Lecture NotesMonograph Series. IMS (2007)
 [41] Dalalyan, A.S., Tsybakov, A.B.: Aggregation by exponential weighting, sharp PACBayesian bounds and sparsity. Machine Learning 72 (2008) 39–61
 [42] Dalalyan, A.S., Tsybakov, A.B.: Sparse regression learning by aggregation and Langevin MonteCarlo. J. Comput. System Sci. 78 (2012) 1423–1443
 [43] Dalalyan, A.S., Salmon, J.: Sharp oracle inequalities for aggregation of affine estimators. The Annals of Statistics 40 (2012) 2327–2355
 [44] Alquier, P., Lounici, K.: PACBayesian bounds for sparse regression estimation with exponential weights. Electronic Journal of Statistics 5 (2011) 127–145
 [45] Audibert, J.Y., Catoni, O.: Robust linear least squares regression. The Annals of Statistics 39 (2011) 2766–2794