Bayesian methods for low-rank matrix estimation: short survey and theoretical study

Bayesian methods for low-rank matrix estimation: short survey and theoretical study

Pierre Alquier CREST, ENSAE, Université Paris Saclay
11email: pierre.alquier@ensae.fr
July 13, 2019
Abstract

This is the corrected version of a paper that was published as

P. Alquier, Bayesian methods for low-rank matrix estimation: short survey and theoretical study, Algorithmic Learning Theory, 2013, S. Jain, R. Munos, F. Stephan and T. Zeugmann Eds., Springer - Lecture Notes in Artificial Intelligence n. 8139, pp. 309-323.

Since then, a mistake was found in the proofs. We fixed the mistake at the price of a minor change in the final result. 111We thank Arnak Dalalyan (ENSAE) who found the mistake. 222At the time of the original publication, the author was a lecturer at UCD Dublin (School of Mathematical Sciences).

The problem of low-rank matrix estimation recently received a lot of attention due to challenging applications. A lot of work has been done on rank-penalized methods [1] and convex relaxation [2], both on the theoretical and applied sides. However, only a few papers considered Bayesian estimation. In this paper, we review the different type of priors considered on matrices to favour low-rank. We also prove that the obtained Bayesian estimators, under suitable assumptions, enjoys the same optimality properties as the ones based on penalization.

Keywords: Bayesian inference, collaborative filtering, reduced-rank regression, matrix completion, PAC-Bayesian bounds, oracle inequalities.

Pierre Alquier

1 Introduction

The problem of low-rank matrix estimation recently received a lot of attention, due to challenging high-dimensional applications provided by recommender systems, see e.g. the NetFlix challenge [3]. Depending on the application, several different models are studied: matrix completion [2], reduced-rank regression [4], trace regression, e.g. [5], quantum tomogaphy, e.g. [6], etc.

In all the above mentionned papers, the authors considered estimators obtained by minimizing a criterion that is the sum of two terms: a measure of the quality of data fitting, and a penalization term that is added to avoid overfitting. This term is usually the rank of the matrix, as in [1], or, for computational reasons, the nuclear norm of the matrix, as in [2] (the nuclear norm is the sum of the absolute values of the singular values, it can be seen as a matrix equivalent of the vectors norm). However, it is to be noted that only a few papers considered Bayesian methods: we mention [7] for a first study of reduced-rank regression in Bayesian econometrics, and more recently [8, 9, 10] for matrix completion and reduced-rank regression (a more exhaustive bibliography is given below).

The objective of this paper is twofold: first, in Section 2 we provide a short survey of the priors that have been effectively used in various problems of low-rank estimation. We focus on two models, matrix completion and reduced rank-regression, but all the priors can be used in any model involving low-rank matrix estimation.

Then, in Section 3 we prove a theoretical result on the Bayesian estimator in the context of reduced rank regression. It should be noted that for some appropriate choice of the hyperparameters, the rate of convergence is the same as for penalized methods, up to terms. The theoretical study in the context of matrix completion will be the object of a future work.

2 Model and priors

In this section, we briefly introduce two models: reduced rank regression, 2.1, and matrix completion, 2.2. We then review the priors used in these models, 2.3.

2.1 Reduced rank regression

In the matrix regression model, we observe two matrices and with

where is an deterministic matrix, is a deterministic matrix and is an random matrix with . The objective is to estimate the parameter matrix . This model is sometimes refered as multivariate linear regression, matrix regression or multitask learning. In many applications, it makes sense to assume that the matrix has low rank, i.e. . In this case, the model is known as reduced rank regression, and was studied as early as [11, 12]. We refer the reader to the monograph [4] for a complete introduction.

Depending on the application the authors have in mind, additional assumptions on the distribution of the noise matrix are used:

  • the entries of are i.i.d., and the probability distribution of is bounded, sub-Gaussian or Gaussian . In this case, note that the likelihood of any matrix is given by

    where we let denote the Frobenius norm, .

  • as a generalization of the latter case, it is often assumed in econometrics papers that the rows of are i.i.d. for some variance-covariance matrix .

In order to estimate , we have to specify a prior on and, depending on the assumptions on , a prior on or on . Note however that in most theoretical papers, it is assumed that is known, or can be upper bounded, as in [1]. This assumption is clearly a limitation but it makes sense in some applications: see e.g. [6] for quantum tomography (that can bee seen as a special case of reduced rank regression).

In non-Bayesian studies, the estimator considered is usually obtained by minimizing the least-square criterion penalized by the rank of the matrix [1] or the nuclear norm [13]. In [1], the estimator obtained by this method is shown to satisfy, for some constant ,

(Corollary 6 p. 1290).

2.2 Matrix completion

In the problem of matrix completion, one observes entries of an matrix for in a given set of indices . Here again, the noise matrix satisfies and the objective is to recover under the assumption that . Note that under the assumption that the are i.i.d. , the likelihood is given by

In [2], this problem is studied without noise (i.e. ), the general case is studied among others in [14, 15, 16].

Note that recently, some authors studied the trace regression model, that includes linear regression, reduced-rank regression and matrix completion as special cases: see [17, 18, 5, 19]. Up to our knowledge, this model has not been considered from a Bayesian perspective until now, so we will mainly focus on reduced regression and matrix completion in this paper. However, all the priors defined for reduced-rank regression can also be used for the more general trace regression setting.

2.3 Priors on (approximately) low-rank matrices

It appears that some econometrics models can actually be seen as special cases of the reduced rank regression. Some of them were studied from a Bayesian perspetive from the seventies, to our knowledge, it was the first Bayesian study of a reduced rank regression:

  • incomplete simultaneous equation model: [20, 21, 22, 23],

  • cointegration: [24, 25, 26].

The first systematic treatment of the reduced rank model from a Bayesian perspective was carried out in [7]. The idea of this paper is to write the matrix parameter as for two matrices and respectively and , and to give a prior on and rather than on . Note that the rank of is in any case smaller than . So, to choose imposes a low rank structure to the matrix .

The prior in [7] is given by

where is a Gaussian shrinkage on all the entries of the matrices:

for some parameter . Then, is an -dimensional inverse-Wishart distribution with degrees of freedom and matrix parameter , :

Remark that this prior is particularly convenient as it is then possible to give explicit forms for the marginal posteriors. This allows an implementation of the Gibbs algorithm to sample from the posterior. As the formulas are a bit cumbersome, we do not provide them here, however, the interested reader can find them in [7].

The weak point in this approach is that the question of the choice of the reduced rank is not addressed. It is possible to estimate and for any possible and to use Bayes factors for model selection, as in [26]. Numerical approximation and assessment of convergence for this method are provided by [27].

A more recent approach consists in fixing a large , as , and then in calibrating the prior so that it would naturally favour matrices with rank smaller than (or, really close to such matrices). To our knowledge, the first attempt in this direction is [8]. Note that this paper was actually about matrix completion rather than reduced rank regression, but once again, all the priors in this subsection can be used in both settings. Here again, we write , and

In other words, if we write and , then the and are independent and respectively and where is the indentity matrix of size . In order to understand the idea behind this prior, assume for one moment that and are large for and very small for . Then, for , and have entries close to , and so . So, the matrix

a matrix that has a rank at most . In practice, the choice of the ’s and ’s is the main difficulty of this approach. Based on a heuristic, the authors proposed an estimation of these quantities that seems to perform well in practice. Remark that the authors assume that the are independent and the parameter is not modelled in the prior (but is still estimated on the data). They finally propose a variational Bayes approach to approximate the posterior.

Very similar priors were used by [9] and in the PMF method (Probabilistic Matrix Factorisation) of [10]. However, improved versions were proposed in [28, 29, 30, 31]: the authors proposed a full Bayesian treatment of the problem by putting priors on the hyperparameters. We describe more precisely the prior in [28]: the and are independent and respectively and , and then: , , and finally . Here again, the hyperparameters , and are to be specified. The priors in [29, 30] are quite similar, and we give more details about the one in [30] in Section 3. In [10, 28, 29, 30], the authors simulate from the posterior thanks to the Gibbs sampler (the posterior conditional distribution are explicitely provided e.g. in [28]). Alternatively, [9] uses a stochastic gradient descent to approximate the MAP (maximum a posteriori).

Some papers proposed a kernelized version of the reduced rank regression and matrix completion models. Let denote the -th row of and the -th row of . Then, leads to . We can replace this relation by

for some RKHS Kernel . In [32], the authors propose a Bayesian formulation of this model: is seen as a Gaussian process on with expectation zero and covariance function related to the kernel . The same idea is refined in [33] and applied successfully to very large datasets, including the NetFlix challenge dataset, thanks to two algorithms: the Gibbs sampler, and the EM algorithm to approximate the MAP.

Finally, we want to mention the nice theoretical work [34, 35]: in these papers, the authors study the asymptotic performance of Bayesian estimators in the reduced rank regression model under a general prior that has a compactly supported and infinitely differentiable density. Clearly, the priors aforementioned do not fit the compact support assumption. The question wether algorithmically tractable priors fit this assumption is, to our knowledge, still open. In Section 3, we propose a non-asymptotic analysis of the prior of [30].

3 Theoretical analysis

In this section, we provide a theoretical analysis of the Bayesian estimators obtained by using the idea of hierarchical priors of  [28, 29, 30, 31]. More precisely, we use exactly the prior of [30] and provide a theoretical result on the performance of the estimator in the reduced-rank regression model.

Several approaches are available to study the performance of Bayesian estimators: the asymptotic approach based on Bernstein-von-Mises type theorems, see Chapter 10 in [36], and a non-asmptotic approach based on PAC-Bayesian inequalities. PAC-Bayesian inequalities were introduced for classification by [37, 38] but tighter bounds and extentions to regression estimation can be found in [39, 40, 41, 42, 43]. In all approaches, the variance of the noise is assumed to be known or at least upper-bounded by a given constant, so we use this framework here. To our knowledge, this is the first application of PAC-Bayesian bounds to a matrix estimation problem.

3.1 Theorem

Following [30] we write where is , is , and then

for some diagonal matrix

the are i.i.d. and :

where

We will make one of the following assumptions on the noise:

  • Assumption (A1): the entries of are i.i.d. , and we know an upper bound for .

  • Assumption (A2): the entries of are iid according to any distribution supported by the compact interval with a density w.r.t. the Lebesgue measure and , and we know an upper bound .

Note that (A1) and (A2) are special case of the one in [41], the interested reader can replace these assumptions by the more technical condition given in [41]. We define

where is the probability distribution given by

Note that in the case where the entries of are i.i.d. then this is the Bayesian posterior, , when , and so is the expectation under the posterior. However, for theoretical reasons, we have to consider slightly smaller to prove theoretical results.

Theorem 1

Assume that either (A1) or (A2) is satisfied. Let us put and in the prior . For we have, for any ,

Remark 1

Note that when all the entries of satisfy for some , . Moreover, let us assume that and that we can write with and and . Assume that the noise is Gaussian. Taking for example we get

where we remind that . When , we can see that we recover the same upper bound as in [1], up to a term. This rate (without the ) is known to be optimal, see [1] remark (ii) p. 1293 and [17]. However, the presence of the terms and can lead to suboptimal rates in less classical asymptotics where would grow with the sample size . In the case of linear regression, a way to avoid these terms is to use heavy-tailed priors as in [41, 42], or compactly supported priors as in [44]. However, it is not clear whether this approach would lead to feasible algorithms in matrix estimation problems. This question will be the object of a future work.

Remark 2

We do not claim that the choice is optimal in practice. However, from the proof it is clear that our technique requires that decreases with the dimension of as well as with the sample size to produce a meaningfull bound. Note that in [30], there is no theoretical approach for the choice of , but their simulation study tends to show that must be very small for to be approximately low-rank.

Remark 3

In all the above mentionned papers on PAC-Bayesian bounds, it is assumed that the variance of the noise is known, or upper-bounded by a known constant. More recently, [45] managed to prove PAC-Bayesian inequalities for regression with unknown variance. However, the approach is rather involved and it is not clear whether it can be used in our context. This question will also be addressed in a future work.

3.2 Proof

First, we state the following result:

Theorem 2

Under (A1) or (A2), for any , we have

where stands for the Kullback divergence between and , if is absolutely continuous with respect to and otherwise.

Proof of Theorem 2. Follow the proof of Theorem 1 in [41] and check that every step is valid when is a matrix instead of a vector.

We are now ready to prove our main result.

Proof of Theorem 1. Let us introduce, for any , the probability distribution . According to Theorem 2 we have

(1)

Let us fix , and . The remaining steps of the proof are to upper-bound the two terms in the r.h.s. Both upper bounds will depend on , we will optimize on after these steps to end the proof. We have, for any ,

(2)
(3)

Now, we deal with the second term:

We remind that and and let us denote the subset of such that for . We let denote the cardinality of , . Note that we have . For any let be the event

Then

(4)

By symmetry, we will only bound the first of these two terms. We have

(5)

We lower-bound the three factors in the integral in (5) separately. First, note that, on ,

(6)

On , and for :

where is the c.d.f. of . We use the classical inequality

to get:

and finally

(7)

Then, on , and for :

and so

(8)

We plug (6), (7) and (8) into (5) and we obtain:

Now, let us impose the following restrictions: so the last factor is . So we have:

So,

(9)

By symmetry,

(10)

and finally, plugging (9) and (10) into (4)

(11)

Finally, we can plug (3) and (11) into (1):

Let us put to get:

Finally, remember that the conditions of the theorem impose that , and . However, we used until now that , that , that , and that . Remember that so all these equations are compatible. We obtain:

This ends the proof.

4 Conclusion

We proved that the use of Gaussian priors in reduced-rank regression models leads to nearly optimal rates of convergence. As mentionned in the paper, alternative priors would possibly lead to better bounds but could also result in less computationaly efficient methods (computational efficiency is a major issue when dealing with high-dimensional datasets such as the NetFlix dataset). A complete exploration of this issue will be addressed in future works.

References

  • [1] Bunea, F., She, Y., Wegkamp, M.H.: Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics 39 (2011) 1282–1309
  • [2] Candès, E., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory 56 (2009) 2053–2080
  • [3] Bennett, J., Lanning, S.: The netflix prize. In: Proceedings of KDD Cup and Workshop 07. (2007)
  • [4] Reinsel, G.C., Velu, R.P.: Multivariate reduced-rank regression: theory and applications. Springer Lecture Notes in Statistics 136 (1998)
  • [5] Koltchinskii, V., Lounici, K., Tsybakov, A.B.: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics 39 (2011) 2302–2329
  • [6] Alquier, P., Butucea, C., Hebiri, M., Meziani, K., Morimae, T.: Rank-penalized estimation of a quantum system. Preprint arXiv:1206.1711 (2012)
  • [7] Geweke, J.: Bayesian reduced rank regression in econometrics. Journal of Econometrics 75 (1996) 121–146
  • [8] Lim, Y.J., Teh, Y.W.: Variational bayesian approach to movie rating prediction. In: Proceedings of KDD Cup and Workshop 07. (2007)
  • [9] Lawrence, N.D., Urtasun, R.: Non-linear matrix factorization with Gaussian processes. In: Proceedings of the 26th annual International Conference on Machine Learning (ICML09), ACM, New York (2009) 601–608
  • [10] Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization. In Platt, J.C., Koller, D., Singer, Y., Roweis, S., eds.: Advances in Neural Information Processing Systems 20 (NIPS2007), Cambridge, MIT Press (2008)
  • [11] Anderson, T.: Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics 22 (1951) 327–351
  • [12] Izenman, A.: Reduced rank regression for the multivariate linear model. Journal of Multivariate Analysis 5 (1975) 248–264
  • [13] Yuan, M., Ekici, A., Lu, Z., Monteiro, R.: Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society - Series B 69 (2007) 329–346
  • [14] Candès, E., Plan, Y.: Matrix completion with noise. Proceedings of the IEEE 98 (2009) 625–636
  • [15] Candès, E., Recht, B.: Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9 (2009) 717–772
  • [16] Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory 57 (2011) 1548–1566
  • [17] Rohde, A., Tsybakov, A.B.: Estimation of high-dimensional low-rank matrices. The Annals of Statistics 39 (2011) 887–930
  • [18] Klopp, O.: Rank-penalized estimators for high-dimensionnal matrices. Electronic Journal of Statistics 5 (2011) 1161–1183
  • [19] Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer Lecture Notes in Mathematics (2011)
  • [20] Dreze, J.H.: Bayesian limited information analysis of the simultaneous equation model. Econometrica 44 (1976) 1045–1075
  • [21] Dreze, J.H., Richard, J.F.: Bayesian analysis of simultaneous equation models. In Griliches, Z. ans Intriligater, J.F., ed.: Handbook of econometrics, vol. 1, North-Holland, Amsterdam (1983)
  • [22] Zellner, A., Min, C., Dallaire, D.: Bayesian analysis of simultaenous equation and related models using the Gibbs sampler and convergence checks. H. G. B. Alexander Research Founsation working paper, University of Chicago (1993)
  • [23] Kleibergen, F., van Dijk, H.K.: Bayesian simultaneous equation analysis using reduced rank structures. Econometric theory 14 (1998) 699–744
  • [24] Bauwens, L., Lubrano, M.: Identification restriction and posterior densities in cointegrated gaussian var systems. In Fomby, T.M., Carter Hill, R., eds.: Advances in econometrics, vol. 11(B), JAI Press, Greenwich (1993)
  • [25] Kleibergen, F., van Dijk, H.K.: On the shape of the likelihood-posterior in cointegration models. Econometric theory 10 (1994) 514–551
  • [26] Kleibergen, F., Paap, R.: Priors, posteriors and bayes factors for a bayesian analysis of cointegration. Journal of Econometrics 111 (2002) 223–249
  • [27] Corander, J., Villani, M.: Bayesian assessment of dimensionality in reduced rank regression. Statistica Neerlandica 58 (2004) 255–270
  • [28] Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using markov chain monte carlo. In: Proceedings of the 25th annual International Conference on Machine Learning (ICML08), ACM, New York (2008)
  • [29] Zhou, M., Wang, C., Chen, M., Paisley, J., Dunson, D., Carin, L.: Nonparametric bayesian matrix completion. In: IEEE Sensor Array and Multichannel Signal Processing Workshop. (2010)
  • [30] Babacan, S.D., Luessi, M., Molina, R., Katsaggelos, A.K.: Low-rank matrix completion by variational sparse bayesian learning. In: IEEE International Conference on Audio, Speech and Signal Processing, Prague (Czech Republic) (2011) 2188–2191
  • [31] Paisley, J., Carin, L.: A nonparametric bayesian model for kernel matrix completion. In: Proceedings of ICASSP 2010, Dallas, USA. (2010)
  • [32] Yu, K., Tresp, V., Schwaighofer, A.: Learning Gaussian processes for multiple tasks. In: Proceedings of the 22th annual International Conference on Machine Learning (ICML05). (2005)
  • [33] Yu, K., Lafferty, J., Zhu, S., Gong, Y.: Large-scale collaborative prediction using a non-parametric random effects model. In: Proceedings of the 26th annual International Conference on Machine Learning (ICML09), ACM, New York (2009)
  • [34] Aoyagi, M., Watanabe, S.: The generalization error of reduced rank regression in bayesian estimation. In: International Symposium on Information Theory and its Applications (ISITA2004), Parma (Italy) (2004)
  • [35] Aoyagi, M., Watanabe, S.: Stochastic complexities of reduced rank regression in bayesian estimation. Neural networks 18 (2005) 924–933
  • [36] van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press (1998)
  • [37] Shawe-Taylor, J., Williamson, R.: A PAC analysis of a Bayes estimator. In: Proceedings of the Tenth Annual Conference on Computational Learning Theory, New York, ACM (1997) 2–9
  • [38] McAllester, D.A.: Some pac-bayesian theorems. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory (Madison, WI, 1998), ACM (1998) 230–234
  • [39] Catoni, O.: Statistical Learning Theory and Stochastic Optimization. Springer Lecture Notes in Mathematics (2004)
  • [40] Catoni, O.: PAC-Bayesian Supervised Classification (The Thermodynamics of Statistical Learning). Volume 56 of Lecture Notes-Monograph Series. IMS (2007)
  • [41] Dalalyan, A.S., Tsybakov, A.B.: Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72 (2008) 39–61
  • [42] Dalalyan, A.S., Tsybakov, A.B.: Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. System Sci. 78 (2012) 1423–1443
  • [43] Dalalyan, A.S., Salmon, J.: Sharp oracle inequalities for aggregation of affine estimators. The Annals of Statistics 40 (2012) 2327–2355
  • [44] Alquier, P., Lounici, K.: PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electronic Journal of Statistics 5 (2011) 127–145
  • [45] Audibert, J.Y., Catoni, O.: Robust linear least squares regression. The Annals of Statistics 39 (2011) 2766–2794
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
206481
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description