Recommendation on a Budget: Column Space Recovery from Partially Observed Entries with Random or Active Sampling

Recommendation on a Budget: Column Space Recovery from Partially Observed Entries with Random or Active Sampling

Abstract

We analyze alternating minimization for column space recovery of a partially observed, approximately low rank matrix with a growing number of columns and a fixed budget of observations per column. In this work, we prove that if the budget is greater than the rank of the matrix, column space recovery succeeds – as the number of columns grows, the estimate from alternating minimization converges to the true column space with probability tending to one. From our proof techniques, we naturally formulate an active sampling strategy for choosing entries of a column that is theoretically and empirically (on synthetic and real data) better than the commonly studied uniformly random sampling strategy.

1 Introduction

In many applications of recommendation systems, we have data in the form of an incomplete matrix, where one dimension is growing and the other dimension is fixed. For instance, in recommendation systems, there is a fixed set of potential products (rows of a matrix) to offer customers that arrive over time (columns of a matrix). Three other applications are choosing machine learning models (rows) for each new customer’s dataset (columns) [11], choosing which survey questions (rows) to ask to respondents (columns) that arrive sequentially [36], or choosing which lab tests (rows) to order for each new patient (columns) [18]. In these cases, there is an inherent asymmetry with respect to the dimensions in the budget: we have a budget over each column, not over each row. We could choose any machine learning model and recommend it for each dataset, or choose any survey question and give it to every user, but it is very hard to run every machine learning pipeline on an arbitrary dataset, or to give every survey question to an arbitrary respondent (indeed, in [36], users omitting too many answers was the precise motivation for their problem). Similarly, running all lab tests on one patient siginificantly exceeds the time and cost budget per patient.

In these applications, we are often interested in approximately recovering the column space of a matrix, or equivalently, the subspace spanned by the top principal components of a data matrix. This subspace would give insights as to which machine learning models tend to perform better, which questions are most informative to ask in a survey, or which lab tests would be most valuable to order.

In particular, for a matrix that has approximately low rank , we are interested in the case where we have a fixed number of entries that are sampled for each new column. We can then pose the following questions – is it possible to recover the column space, with growing accuracy and higher confidence as increases? And if we learn the column space more accurately, does this lead to better imputation of the matrix?

In this work, we show that for an approximately rank matrix with rows and columns, when we have a budget of observations per column, we can recover the column space with probability tending to one (as grows) using alternating minimization when samples are randomly selected. Moreover, we establish theoretically and experimentally that an active learning strategy can help learn this subspace faster. We also show experimentally that more accurate column space recovery can lead to more accurate matrix completion.

1.1 Related Works

There are two natural ways to approach column space recovery with random sampling, which leads to two areas of related works: using the empirical covariance matrix, or using matrix completion results.

One approach, typically taken in the streaming PCA literature, is to to assume that columns are i.i.d. and use the empirical covariance matrix of the columns to estimate the true covariance [31, 15, 32]. We can then use the column space of this estimated covariance matrix. This approach works, but it loses efficiency due to rescaling: for instance, if every entry is observed with probability , then because each entry of the empirical covariance matrix is the product of two observed entries of the original matrix, each (off-diagonal) entry of the empirical covariance matrix is observed with probability . Therefore, this approach pays a penalty instead of penalty in terms of missingness. Moreover, while matrix completion approaches can have a dependence on the desired accuracy (in the low noise regime) for sample complexity, passing through the empirical covariance matrix naturally results in an penalty [31, 15, 32]. Other work [9] in the streaming PCA literature avoids covariance estimation using a least squares approach (similar to us), but do not prove convergence to the true subspace.

Another approach would be to rely on powerful results in matrix completion (See, for instance, [4, 3, 5, 26, 33, 2, 6, 19, 16, 23, 24, 14]). However, there is no straightforward way to do this. For instance, one might think one could first perform matrix completion on the partially observed matrix, and then use its singular value decomposition to recover the column space. However, for an matrix with whose rank is , matrix completion results typically require more than observations. Exceptions to the superlinear (in ) number of total observations [27, 28] violate our per-column budget or require a higher per-column budget for higher accuracy [12]. This means that in order to get the desired guarantees from the matrix completion literature, we need to observe an increasing number of entries per column. This is not a natural model for the budgeted learning case (there is no reason to assume that our budget increases with time) and is unnecessary, as we show in our theoretical results. Another way to try to apply these matrix completion results is to split an matrix into matrices, with , perform matrix completion on these smaller matrices (which now have enough samples), and then combine the resulting column space estimates. This might work if matrix completion were unbiased, but since the estimates tend to be the solution of a regularized problem, they tend to be biased (and bias correction is not simple [20]).

As for active learning, there have been experimental results showing it can help matrix factorization and completion [10, 17, 22], but they rarely come with theoretical guarantees. [21], like us, consider a setting where customers are arriving with time, but their algorithm deviates from uniform sampling only for minimization of a bandit-like regret quantity, not for better estimation. As mentioned above, [27, 28] prove theoretical results on matrix completion with active sampling, but they violate the budget assumption by sampling some columns in their entirety. [15] prove active sampling can help, but they share the drawbacks of using the first (covariance matrix estimation) approach and their estimate is drawn stochastically from a distribution (even after fixing the observations), resulting in error bounds that hold only in expectation, not with high probability.

While matrix completion results do not apply to our setting, in this work, we will leverage some of the technical components from that literature. In particular, we show theoretically that alternating minimization will consistently recover the column subspace, both for uniformly random sampling and for active sampling.

1.2 Organization

The paper is organized in the following way: We first state the notation and assumptions (Section 2), followed by our algorithms (Section 3). We then state our theoretical results (Section 4) and present our experimental results (Section 5). We conclude by mentioning ideas of the proof (Section 6) followed by a brief summary (Section 7).

2 Background

Notation For , we use to denote and for , , we use to denote . For a matrix , given , a subset of indices (typically the indices of the observed entries), we define by setting the entries with indices not in to :

 (PΩ(Y))ij={Yij(i,j)∈Ω0(i,j)∉Ω.

For , , we denote by the set . We take complements of these sets by . The singular value decomposition (SVD) of expresses as , , where is the rank of , and the columns of are orthornomal (known as the left singular vectors of ), the columns of are orthonormal (the right singular vectors of ), and is diagonal and contains the singular values. is the Frobenius norm, given by . We use to denote the operator norm, given by , where are the singular values of . Throughout our paper, will denote the total number of columns of that are available, whereas is the second dimension of an () submatrix we are considering at a particular point.

2.1 Assumptions

Our goal is to estimate the column space of an approximately low rank matrix as the number of columns of the matrix grows. This is not possible for arbitrary growing matrices . As an extreme example, if all the columns after some point are identically zero, then we will no longer be able to learn anything about the column space, which means we need to assume that is “not too small”. On the other hand, if keeps growing too fast, we will only fit on the latest columns, which makes learning impossible, so we need to be “not too large”.

First, we will assume that arises from a low rank plus noise model. We will assume that the noise is actually Gaussian because we will use its rotational symmetry in the proofs. It is likely possible to relax this to more general classes of noise matrices, but we leave this for future work.

Assumption 2.1 (Low rank plus Gaussian noise).

, where , and where .

Next, we need to make assumptions about . Before stating the assumptions, we first define the -norm.

Definition 2.2 (ψ2-norm).

For a real valued random variable , its norm is defined by

 ∥A∥ψ2=inf{u>0 : Eexp(A2/u2)≤2}.
Definition 2.3 (sub-Gaussian).

We say that a real random variable is 1-sub-Gaussian if . We say that a random variable with values in is 1-sub-Gaussian if is 1-sub-Gaussian for all with .

As alluded to previously, we have assumptions that control the growth of (and therefore ) to be not too large and not too small. Because we have two phases the algorithm, initialization and iteration, we require two forms of these bounds. For initialization, our assumption is essentially the same as Assumption 1 of [31].

Assumption 2.4 (sub-Gaussian Wt).

For each , each column of satisfies:

1. is drawn independently (for each ) from a 1-sub-Gaussian distribution;

2. there exists a numerical constant with such that

 E(⟨wm,u⟩)≥c1∥⟨wm,u⟩∥ψ2 ∀u∈Rr.

For iteration, we also need non-asymptotic bounds on the singular values, which would hold if were i.i.d. Gaussian from results from random matrix theory (see Corollary 5.35 of [35]).

Assumption 2.5 (Growth of Singular Values).

We assume that , and that there exists a large enough that for every , satisfies

 σr(˚XWTt)≥34σr(˚X)√t,∥˚XWTt∥≤32σ1(˚X)√t (1)

with probability at least for .

For matrix completion, we need an incoherence assumption as in [5], [4], and [33]. There are many ways of interpreting this parameter, but intuitively, it says that observing an entry actually gives information about other entries. It turns out that generating i.i.d. Gaussians for each entry of will produce right singular vectors that are incoherent: with the SVD, for some constants , with probability at least , (See Lemma 2.2 of [4]). Here denotes projection to the column space of . This metric is equivalent to the coherence definition given below, which leads to Assumption 2.7.

Definition 2.6.

The coherence of an matrix is .

Assumption 2.7 (Incoherence).

There exists some such that for large enough , for any subset of of size , with probability at least , .

Note we do not assume incoherence of the column space of . In practice, having incoherent column space is probably helpful. But for our theoretical results, because is fixed as the number of columns is growing, incoherence of , which provides high probability bounds with respect to (not ), are not as useful.

An example that satisfies all these assumptions is when each column of has entries that are distributed i.i.d. according to for some rank covariance matrix .

3 Algorithms

One way to view the column space of a matrix is to view it as the span of the top eigenvectors of . We have where , and are the columns of . If we sampled each entry uniformly at random with probability , we can get an estimate of each in the following way: let be the columns of , and consider . For independent Bernoulli() sampling, if we form the matrix , we have . So if we approximate the eigenvectors of , we might expect them to be close to the eigenvectors of under mild assumptions. This is the approach taken by [15] and [32]. Indeed, under our assumptions, this will properly estimate the column subspace in expectation (Lemma 2 in [15]). If we exactly compute the eigendecomposition (which is computationally less efficient but has the best theoretical guarantees), we obtain ScaledPCA (Algorithm 6), essentially the same as POPCA of [15]), whose pseudocode is included in the Appendix. 3

This is a nice and intuitive algorithm, but for matrix completion, it is known that methods based purely on spectral decompositions are outperformed by methods based on optimization on the Frobenius norm of recovery error (such as least squares, gradient descent, or message passing) [25]. What is worse for ScaledPCA is that because it estimates the covariance matrix first, it essentially pays a penalty in terms of missingness instead of a penalty.

In this work, we give a proof that alternating minimization (Algorithm 1) can indeed be used to recover the column subspace. Algorithm 1 performs spectral intialization followed by alternating minimization, using some of the samples () to estimate and the remaining samples () to estimate . Algorithm 1 uses two subroutines, Sample and MedianLS . MedianLS uses SmoothQR[16], which is a version of QR factorization that adds noise before performing QR, which for completeness, we include in Section D.1 of the Appendix. SmoothQRhelps maintains incoherence of the estimate of in MedianLS , and taking the median of estimates of leads to a higher probability bound, which are useful for our theory, but not necessary in practice [16].

We denote by a subset that was sampled uniformly at random among subsets of of size . In our algorithms, we assume we have enough columns to observe (e.g., for Algorithm 1, ). is an absolute constant that is not required as input. is a constant from our incoherence assumption (Assumption 2.7). We use to denote comments.

Practical Considerations

We state our algorithms in a way that is natural to prove theoretical results, which is the main goal of this paper. However, for more practical purposes, the large block size might at first seem prohibitive to use in MedianLS . We mitigate this in the following way: first, as mentioned above, the SmoothQRstep in Line 6 of MedianLS and median step in Line 13 are not necessary in practice. Therefore, given an , we need only to perform two linear least squares regressions (lines 4 and 11). The first regression (line 4), which fits , can be done separately for each column. The second regression, which fits (line 11), can be performed in an online manner. Two possible options are to perform least squares recursively (which gives exactly the same result as doing a batch linear least squares), or to do gradient descent (which is more practical).

Both of these options process one column at a time (instead of processing it as a block as in Lines 17 and 18 in Algorithm 1), and lead to time and space complexity that is linear in the block size .

Active Sampling

Our proof naturally leads to an active sampling strategy that can help subspace recovery, as confirmed in our experiments. Each iteration of fitting a (Line 3 of MedianLS ) is a linear least squares regression, whose estimation error decreases as the minimum singular value of the design matrix increases. Therefore, a good candidate strategy for Sample is to choose the rows of to maximize the minimum singular value of the induced submatrix. More precisely, for , we define as the operator that projects the matrix to a matrix specified by . (The objective in Equation (2) is invariant to the ordering chosen on .) Given an estimate , our active sampling chooses

 Ω∗(X;k(1))=argmaxS⊂[N],|S|=k(1) σr(QS(X)), (2)

as . We will need other samples of rows of to estimate from this estimated , and we choose these samples randomly, so we can get equal informations about every row of , i.e., is chosen uniformly at random.

4 Theoretical Results

Budget per column

In the following theorems, we will assume that , and . We need to be at least because we observe entries per column for Line 4 of MedianLS . However, need not be as large because as the number of columns tends to infinity, we will observe at least entries in each row. Therefore, the total number of required samples is only per column. But we do not recommend setting as low as 1 in practice, especially without sample splitting.

Subspace Recovery Metric

For our theorem statements, we let be the matrix whose orthonormal columns are the left singular vectors of . In general, when we compute the SVD of , the resulting might not contain the same singular vectors as , but they span the same subspace. We use a distance measure on subspaces that does not depend on such representations, namely the largest principal angle between subspaces. This can be defined for two matrices with orthonormal columns by [37]. Note for any orthogonal matrix .

Initialization

The initialization conditions are quite stringent in theory, but in practice, as has been empirically4 shown in other optimization approaches, only mild initialization can suffice. This is consistent with our own experiments in Section 5.

Proofs of all theorems may be found in the Appendix (Section B). For ease of notation, we define . Note that , and that is a decreasing quantity with respect to . In order to simplify our bounds a little, we will additionally assume that , which implies that .

4.1 Active Sampling

Noise in observations presents an obstacle to recovering the column space, and if the noise variance is too large compared to the -th singular value of , then it can drown out this ‘signal’ in the noise when performing alternating minimization. Therefore, we impose Assumption 4.1 or 4.5 to ensure that we have enough signal.

Assumption 4.1 (Size of Noise for Active Sampling).

.

There are two factors that influence the rate of convergence. One factor is that we only have partial observations. The other factor is that we have noise in our observations. When is small compared to the desired accuracy ,

 σz√k(1)≤ϵσ1(˚X)√r, (3)

the effect of having only partial observations dominates. For instance, this is always true when observations do not contain noise. When is large compared to the desired accuracy ,

 σz√k(1)≥ϵσ1(˚X)√r, (4)

the effect of noise dominates. Therefore, we prove different convergence rates for each regime.

Theorem 4.2 (Noisy observations, active sampling, for small σZ/ϵ).

Suppose Assumptions 2.1, 4.1, 2.4, 2.5, 2.7 hold, , , and Equation (3) holds. Then there exist constants , , such that for all , if we initialize with columns, where

 Minit≥Cinit???σ1(˚X)6N2(logMinit)3r2σr(˚X)6(k(1)+k(2))2q(1), (5)

and we use blocks, where

 s≥log(σr(˚X)√q(1)48√rϵ), (6)

and each block has size , with

 M≥Citer???σ1(˚X)6r3N(logM)2σr(˚X)6k(2)q(1)+log(1ϵ), (7)

then ColumnSpaceEstimate(, , , , , , True) returns an such that with probability at least .

Whenever Theorem 4.2 holds, the sample complexity grows only logarithmically with , which is a feature of a matrix completion approach (versus a spectral approach, which always has a dependence of ) in the small regime.

When the is large compared to the desired accuracy , we can get a -dependent bound, with a dependence on desired accuracy . The initialization step for this regime consists of Algorithm 1 instead of a spectral initialization. The full pseudocode for DoubleColumnSpaceEstimate (Algorithm 4) can be found in the Appendix.

Theorem 4.3 (Noisy observations, active sampling, for large σZϵ).

Suppose Assumptions 2.1, 4.1,2.5, 2.7 hold, , . Let satisfy equation (4). Then there exist constants such that for every , if we initialize with columns, where

 Minit≥Cinit???σ1(˚X)6N2(logMinit)3r2σr(˚X)6(k(1)+k(2))2q(1)

and perform alternating minimization with blocks of size

 M1≥Citer???σ1(˚X)6r3N(logM)2σr(˚X)6k(2)q(1)+log(1ϵ),

followed by alternating minimization with block of size

then DoubleColumnSpaceEstimate(, , , , , , , , True) returns an such that with probability at least .

Comparison with ScaledPCA

We compare with the theoretical results from using the ScaledPCA approach with Proposition 3 from [31], as [15] show theorems in a different setting, use a different metric, and prove bounds only in expectation. For simplicity, we will omit dependence on the condition number (assume ) and assume that . When is small (Equation (3)), Theorem 4.2’s logarithmic dependence on is better than the dependence of [31], but the dependence on and is worse, by . When is large (Equation (4), Theorem 4.3), our sample complexity needs as many samples as [31], which can be fairly small.

4.2 Uniformly random sampling

When we use random sampling, there is a chance per column that we might choose a “bad” subset, which is small with respect to , but does not change with respect to . Since we need to avoid “bad” subsets for all columns, in the regime of , this would give us an unacceptable probability of failure in theory, though in practice, this probably does not occur. Therefore, we assume that the true has no “bad” subsets and use a longer initialization period to ensure that our also has no “bad” subsets. When has rank (which is true by Assumption 2.5), the assumption about the absence of “bad” subsets is equivalent to the -isomeric condition by [30].

Definition 4.4 (k-isomeric [30]).

A matrix is called -isomeric if and only if any rows of can linearly represent all rows in .

We define the smallest singular value of any rows of a matrix , which is the opposite of the desired criterion in Equation (2).

 σ∗(X;k(1)):=minS⊂[N],|S|=kσr(QS(˚X)) (8)

Assuming that has rank , if is -isomeric, .

We note that every matrix with orthogonal columns has by Lemma C.5, and in fact, could be arbitrarily small. For random sampling, will play (up to a constant term) the same role as in active sampling, for instance, in the bound on the noise variance.

Assumption 4.5 (Size of Noise for Random Sampling).

.

The difference in sampling complexity in active versus random sampling is the difference between and . Theorems 4.2 and 4.3 still hold with exactly the same proof if we replace with . With this replacement, the corresponding bound for the active learning case will always be better than the bound for the noisy case. For instance, because there is, in general, no lower bound for , we cannot give an upper bound on the initialization step of random sampling that holds independent of , which is something we can do in the case of active sampling. The full statements and proofs for the theorems for the uniformly random sampling case (Theorems A.1 and A.2) can be found in the Appendix.

5 Experiments

Synthetic data

For the simulated data experiments, we use the model from Assumption 2.1 with i.i.d. Gaussian columns. That is, for each simulation, we generate a fixed , and we generate the -th column by , where and , . Since we do not require to be incoherent (which would result from light tailed distributions), we use a heavy tailed distribution (specifically the standard Cauchy distribution) to generate each entry of independently. We set , , and .

MIMIC data

For our real data experiments, we use the MIMIC II dataset, which contains data for ICU visits at Beth Isreael Deaconess Medical Center in Boston, Massachusetts between 2001 and 2008 [29]. We focused on patients aged 18-89 (inclusive) who were having their first ICU visit, and who stayed in the ICU for at least 3 days. For these patients (columns), we took 1269 features which mostly include lab test results. Because the data has many missing entries, we restricted the data to those columns and rows that had less than 50% missing entries, which led to 115 covariates (rows) and 14584 patients (columns). Then, for each run, we randomly chose a submatrix of covariates and patients, and we use as in the simulated data. To evaluate column space recovery , we estimated a “ground truth” using SVD on our data, with missing values replaced by zeros. However, when evaluating recovery, we only measure error on the non-missing values (i.e., those that were present in the data, which is a strict superset of those that were observed by the algorithms).

Approximately active greedy sampling

We choose a fixed number to sample per column. For active sampling, we set , and for random sampling, we set , so that both strategies observe the same number of samples per column. Ideally, our active sampling method would choose the subset of size that satisfies Equation (2). However, since exhaustive search is computationally infeasible, we use an efficient method that approximates this optimization, namely, Algorithm 1 from [1] . This algorithm produces an such that , where . is greater than , but has a similar behavior as for small . Analogues of Theorems 4.3 and 4.2, with replaced by , hold when we use this approximation algorithm for active sampling.

Deviation from theoretical assumptions

Our recovery methods operate in a more practical setting than our theory requires. For alternating minimization, the initialization uses much fewer columns than our theorems require, we do not do sample splitting, we do not fix the time horizon beforehand, and we update as we partially observe each column. This continual updating means that even if we chose at time such that was large, when we use it at some timestep , may not be large. We also skip the SmoothQR and Median steps and add L2 regularization with for stabilization.

Matrix recovery

In many cases, the reason that we care about recovering subspaces accurately is so that we can recover the original matrix accurately. Therefore, we also measure matrix recovery. Given an estimate of the column subspace , the corresponding estimate is computed by imputing the missing entries by taking the best regularized least-squares fit over the observed entries: , where . The algorithms do not have to fit the entries that it has observed, i.e., .

5.1 Results

Figure 1 shows the results of our simulations, averaged over 50 runs. Our active sampling method samples entries as described above (approximately active greedy sampling) and samples uniformly at random. We compare three methods: ScaledPCA (green), alternating minimization with uniformly random sampling (orange), and alternating minimization with active sampling (blue). We denote by and the estimates of and after observing (total) columns. We perform the initialization step with 100 columns, and plot the error as additional columns are observed, for 1000 additional columns for the simulated data and 5000 additional columns for the MIMIC II data. We indicate standard error through shading. In Figures 0(a) and 0(c), the error is the sine of the largest principal angle between two subspaces, as discussed in Section 4, and in Figures 0(b) and 0(d), we use the normalized matrix recovery error, which is given by , for the simulated data. Since we do not know all the entries of the MIMIC II dataset, we use , where consists of the entries for which we have ground truth in the dataset (many of which were not observed by the algorithms).

Column space recovery

Figures 0(a) and 0(c) show that alternating minimization (both random and active sampling) recovers the column space more accurately than ScaledPCA . Furthermore, when using alternating minimization, using active samples results in a lower column space recovery error than using uniformly random samples.

Matrix recovery

In Figures 0(b) and 0(d), we can see that when algorithms have more accurate column space estimates, the corresponding matrix estimate also tends to be more accurate. In Figure 0(d), for the first few hundred columns, alternating minimization with random sampling has a less accurate matrix estimate than ScaledPCA . However, this is only when alternating minimization with random sampling has a poor column space estimate (though still slightly better than that of ScaledPCA ). Moreover, the relative performance of alternating minimization with random sampling improves (both for matrix and column space recovery) as the number of observed columns grows, which is the setting of our theoretical results. Also, note that alternating minimization with active sampling always performs better than ScaledPCA .

6 Ideas of the Proof

Each iteration of alternating minimization involves optimizing given a fixed , and then optimizing given this .

[19] and [16] argue that each minimization step is similar to performing a step in in the power method (e.g., finding the top eigenvector of a symmetric matrix by setting ). In their setting, and , leading to successively better estimation, , with each iteration. (Here, and represent the row subspace and column space, respectively, of the de-noised version of .)

In our setting, because of the asymmetry between and , no longer holds. However, it remains true that . Furthermore, it turns out that by adjusting the block size appropriately, we can make this decrease be large enough to compensate for the increase from to . In a way, this is in the spirit of averaging multiple estimates of the column subspace, by first passing through , and collecting information from enough columns of to gain a more accurate estimate.

In the small regime, this decrease from to is actually multiplicative, leading to exponential convergence in the number of iterations.

7 Conclusion

In this work, we proved that an alternating minimization approach to estimating the column subspace of a partially observed matrix succeeds – as the number of columns grows, we can estimate the column space to any given accuracy with probability tending to 1. We showed theoretically and experimentally that this approach works better than the naive one that performs PCA on the elementwise rescaled empirical covariance matrix. We also showed that using some number of actively chosen samples in addition to random samples outperforms random sampling.