Low-Rank Matrix and Tensor Completion via Adaptive Sampling

Low-Rank Matrix and Tensor Completion via Adaptive Sampling

Akshay Krishnamurthy akshaykr@cs.cmu.edu Computer Science Department
Carnegie Mellon University
Aarti Singh aarti@cs.cmu.edu Machine Learning Department
Carnegie Mellon University
Abstract

We study low rank matrix and tensor completion and propose novel algorithms that employ adaptive sampling schemes to obtain strong performance guarantees. Our algorithms exploit adaptivity to identify entries that are highly informative for learning the column space of the matrix (tensor) and consequently, our results hold even when the row space is highly coherent, in contrast with previous analyses. In the absence of noise, we show that one can exactly recover a matrix of rank from merely matrix entries. We also show that one can recover an order tensor using entries. For noisy recovery, our algorithm consistently estimates a low rank matrix corrupted with noise using entries. We complement our study with simulations that verify our theory and demonstrate the scalability of our algorithms.

1 Introduction

Recently, the machine learning and signal processing communities have focused considerable attention toward understanding the benefits of adaptive sensing. This theme is particularly relevant to modern data analysis, where adaptive sensing has emerged as an efficient alternative to obtaining and processing the large data sets associated with scientific investigation. These empirical observations have lead to a number of theoretical studies characterizing the performance gains offered by adaptive sensing over conventional, passive approaches. In this work, we continue in that direction and study the role of adaptive data acquisition in low rank matrix and tensor completion problems.

Our study is motivated not only by prior theoretical results in favor of adaptive sensing but also by several applications where adaptive sensing is feasible. In recommender systems, obtaining a measurement amounts to asking a user about an item, an interaction that has been deployed in production systems. Another application pertains to network tomography, where a network operator is interested in inferring latencies between hosts in a communication network while injecting few packets into the network. The operator, being in control of the network, can adaptively sample the matrix of pair-wise latencies, potentially reducing the total number of measurements. In particular, the operator can obtain full columns of the matrix by measuring from one host to all others, a sampling strategy we will exploit in this paper.

Yet another example centers around gene expression analysis, where the object of interest is a matrix of expression levels for various genes across a number of conditions. There are typically two types of measurements: low-throughput assays provide highly reliable measurements of single entries in this matrix while high-throughput microarrays provide expression levels of all genes of interest across operating conditions, thus revealing entire columns. The completion problem can be seen as a strategy for learning the expression matrix from both low- and high-throughput data while minimizing the total measurement cost.

1.1 Contributions

We develop algorithms with theoretical guarantees for three low-rank completion problems. The algorithms find a small subset of columns of the matrix (tensor) that can be used to reconstruct or approximate the matrix (tensor). We exploit adaptivity to focus on highly informative columns, and this enables us to do away with the usual incoherence assumptions on the row-space while achieving competitive (or in some cases better) sample complexity bounds. Specifically our results are:

  1. In the absence of noise, we develop a streaming algorithm that enjoys both low sample requirements and computational overhead. In the matrix case, we show that adaptively chosen samples are sufficient for exact recovery, improving on the best known bound of in the passive setting [21]. This also gives the first guarantee for matrix completion with coherent row space.

  2. In the tensor case, we establish that adaptively chosen samples are sufficient for recovering a order tensor of rank . We complement this with a necessary condition for tensor completion under random sampling, showing that our adaptive strategy is competitive with any passive algorithm. These are the first sample complexity upper and lower bounds for exact tensor completion.

  3. In the noisy matrix completion setting, we modify the adaptive column subset selection algorithm of Deshpande et al. [10] to give an algorithm that finds a rank- approximation to a matrix using samples. As before, the algorithm does not require an incoherent row space but we are no longer able to process the matrix sequentially.

  4. Along the way, we improve on existing results for subspace detection from missing data, the problem of testing if a partially observed vector lies in a known subspace.

2 Related Work

The matrix completion problem has received considerable attention in recent years. A series of papers [6, 7, 13, 21], culminating in Recht’s elegent analysis of the nuclear norm minimization program, address the exact matrix completion problem through the framework of convex optimization, establishing that randomly drawn samples are sufficient to exactly identify an matrix with rank . Here and are parameters characterizing the incoherence of the row and column spaces of the matrix, which we will define shortly. Candes and Tao [7] proved that under random sampling samples are necessary, showing that nuclear norm minimization is near-optimal.

The noisy matrix completion problem has also received considerable attention [5, 17, 20]. The majority of these results also involve some parameter that quantifies how much information a single observation reveals, in the same vein as incoherence.

Tensor completion, a natural generalization of matrix completion, is less studied. One challenge stems from the NP-hardness of computing most tensor decompositions, pushing researchers to study alternative structure-inducing norms in lieu of the nuclear norm [11, 22]. Both papers derive algorithms for tensor completion, but neither provide sample complexity bounds for the noiseless case.

Our approach involves adaptive data acquisition, and consequently our work is closely related to a number of papers focusing on using adaptive measurements to estimate a sparse vector [9, 15]. In these problems, specifically, problems where the sparsity basis is known a priori, we have a reasonable understanding of how adaptive sampling can lead to performance improvements. As a low rank matrix is sparse in its unknown eigenbasis, the completion problem is coupled with learning this basis, which poses a new challenge for adaptive sampling procedures.

Another relevant line of work stems from the matrix approximations literature. Broadly speaking, this research is concerned with efficiently computing a structured matrix, i.e. sparse or low rank, that serves as a good approximation to a fully observed input matrix. Two methods that apply to the missing data setting are the Nystrom method [12, 18] and entrywise subsampling [1]. While the sample complexity bounds match ours, the analysis for the Nystrom method has focused on positive-semidefinite kernel matrices and requires incoherence of both the row and column spaces. On the other hand, entrywise subsampling is applicable, but the guarantees are weaker than ours.

It is also worth briefly mentioning the vast body of literature on column subset selection, the task of approximating a matrix by projecting it onto a few of its columns. While the best algorithms, namely volume sampling [14] and sampling according to statistical leverages [3], do not seem to be readily applicable to the missing data setting, some algorithms are. Indeed our procedure for noisy matrix completion is an adaptation of an existing column subset selection procedure [10].

Our techniques are also closely related to ideas employed for subspace detection – testing whether a vector lies in a known subspace – and subspace tracking – learning a time-evolving low-dimensional subspace from vectors lying close to that subspace. Balzano et al. [2] prove guarantees for subspace detection with known subspace and a partially observed vector, and we will improve on their result en route to establishing our guarantees. Subspace tracking from partial information has also been studied [16], but little is known theoretically about this problem.

3 Definitions and Preliminaries

Before presenting our algorithms, we clarify some notation and definitions. Let be a rank matrix with singular value decomposition . Let denote the columns of .

Let denote an order tensor with canonical decomposition:

(1)

where is the outer product. Define to be the smallest value of that establishes this equality. Note that the vectors need not be orthogonal, nor even linearly independent.

The mode- subtensors of , denoted , are order tensors obtained by fixing the th coordinate of the -th mode. For example, if is an order tensor, then are the frontal slices.

We represent a -dimensional subspace as a set of orthonormal basis vectors and in some cases as matrix whose columns are the basis vectors. The interpretation will be clear from context. Define the orthogonal projection onto as .

For a set 111We write for , is the vector whose elements are indexed lexicographically. Similarly the matrix has rows indexed by lexicographically. Note that if is a orthobasis for a subspace, is a matrix with columns where , rather than a set of orthonormal basis vectors. In particular, the matrix need not have orthonormal columns.

These definitions extend to the tensor setting with slight modifications. We use the vec operation to unfold a tensor into a single vector and define the inner product . For a subspace , we write it as a matrix whose columns are , . We can then define projections and subsampling as we did in the vector case.

As in recent work on matrix completion [7, 21], we will require a certain amount of incoherence between the column space associated with () and the standard basis.

Definition 1.

The coherence of an -dimensional subspace is:

(2)

where denotes the th standard basis element.

In previous analyses of matrix completion, the incoherence assumption is that both the row and column spaces of the matrix have coherences upper bounded by . When both spaces are incoherent, each entry of the matrix reveals roughly the same amount of information, so there is little to be gained from adaptive sampling, which typically involves looking for highly informative measurements. Thus the power of adaptivity for these problems should center around relaxing the incoherence assumption, which is the direction we take in this paper. Unfortunately, even under adaptive sampling, it is impossible to identify a rank one matrix that is zero in all but one entry without observing the entire matrix, implying that we cannot completely eliminate the assumption. Instead, we will retain incoherence on the column space, but remove the restrictions on the row space.

4 Exact Completion Problems

In the matrix case, our sequential algorithm builds up the column space of the matrix by selecting a few columns to observe in their entirety. In particular, we maintain a candidate column space and test whether a column lives in or not, choosing to completely observe and add it to if it does not. Balzano et al. [2] observed that we can perform this test with a subsampled version of , meaning that we can recover the column space using few samples. Once we know the column space, recovering the matrix, even from few observations, amounts to solving determined linear systems.

For tensors, the algorithm becomes recursive in nature. At the outer level of the recursion, the algorithm maintains a candidate subspace for the mode subtensors . For each of these subtensors, we test whether lives in and recursively complete that subtensor if it does not. Once we complete the subtensor, we add it to and proceed at the outer level. When the subtensor itself is just a column; we observe the columns in its entirety.

  1. Let .

  2. Randomly draw entries uniformly with replacement w. p. .

  3. For each mode- subtensor of , :

    1. If :

      1. recurse on (, )

      2. . .

    2. Otherwise

  4. Return with mode- subtensors .

Algorithm 1 Sequential Tensor Completion

The pseudocode of the algorithm is given in Algorithm 1. Our first main result characterizes the performance of the tensor completion algorithm. We defer the proof to the appendix.

Theorem 2.

Let be a rank order- tensor with subspaces . Suppose that all of have coherence bounded above by . Set for each . Then with probability , Algorithm 1 exactly recovers and has expected sample complexity

(3)

In the special case of a tensor of order , the algorithm succeeds with high probability using samples, exhibiting a linear dependence on the tensor dimensions. In comparison, the only guarantee we are aware of shows that samples are sufficient for consistent estimation of a noisy tensor, exhibiting a much worse dependence on tensor dimension [23]. In the noiseless scenario, one can unfold the tensor into a matrix and apply any matrix completion algorithm. Unfortunately, without exploiting the additional tensor structure, this approach will scale with , which is similarly much worse than our guarantee. Note that the naïve procedure that does not perform the recursive step has sample complexity scaling with the product of the dimensions and is therefore much worse than the our algorithm.

The most obvious specialization of Theorem 2 is to the matrix completion problem:

Corollary 3.

Let have rank , and fix . Assume . Setting , the sequential algorithm exactly recovers with probability at least while using in expectation

(4)

observations. The algorithm runs in time.

A few comments are in order. Recht [21] guaranteed exact recovery for the nuclear norm minimization procedure as long as the number of observations exceeds where controls the probability of failure and with as another coherence parameter. Without additional assumptions, can be as large as . In this case, our bound improves on his in its the dependence on and logarithmic terms.

The Nystrom method can also be applied to the matrix completion problem, albeit under non-uniform sampling. Given a PSD matrix, one uses a randomly sampled set of columns and the corresponding rows to approximate the remaining entries. Gittens showed that if one samples columns, then one can exactly reconstruct a rank matrix [12]. This result requires incoherence of both row and column spaces, so it is more restrictive than ours. Almost all previous results for exact matrix completion require incoherence of both row and column spaces.

The one exception is a recent paper by Chen et al. that we became aware of while preparing the final version of this work [8]. They show that sampling the matrix according to statistical leverages of the rows and columns can eliminate the need for incoherence assumptions. Specifically, when the matrix has incoherent column space, they show that by first estimating the leverages of the columns, sampling the matrix according to this distribution, and then solving the nuclear norm minimization program, one can recover the matrix with samples. Our result improves on theirs when is small compared to , specifically when , which is common.

Our algorithm is also very computationally efficient. Existing algorithms involve successive singular value decompositions ( per iteration), resulting in much worse running times.

The key ingredient in our proofs is a result pertaining to subspace detection, the task of testing if a subsampled vector lies in a subspace. This result, which improves over the results of Balzano et al. [2], is crucial in obtaining our sample complexity bounds, and may be of independent interest.

Theorem 4.

Let be a -dimensional subspace of and where and . Fix , and let be an index set with entries sampled uniformly with replacement with probability . Then with probability at least :

(5)

Where , , and .

This theorem shows that if then the orthogonal projection from missing data is within a constant factor of the fully observed one. In contrast, Balzano et al. [2] give a similar result that requires to get a constant factor approximation. In the matrix case, this improved dependence on incoherence parameters brings our sample complexity down from to . We conjecture that this theorem can be further improved to eliminate another factor from our final bound.

4.1 Lower Bounds for Uniform Sampling

We adapt the proof strategy of Candes and Tao [7] to the tensor completion problem and establish the following lower bound for uniform sampling:

Theorem 5 (Passive Lower Bound).

Fix and . Fix and suppose that we do not have the condition:

(6)

Then there exist infinitely many pairs of distinct order- tensors of rank with coherence parameter such that with probability at least . Each entry is observed independently with probability .

Theorem 5 implies that as long as the right hand side of Equation 6 is at most , and:

(7)

then with probability at least there are infinitely many matrices that agree on the observed entries. This gives a necessary condition on the number of samples required for tensor completion. Note that when we recover the known lower bound for matrix completion.

Theorem 5 gives a necessary condition under uniform sampling. Comparing with Theorem 2 shows that our procedure outperforms any passive procedure in its dependence on the tensor dimensions. However, our guarantee is suboptimal in its dependence on . The extra factor of would be eliminated by a further improvement to Theorem 5, which we conjecture is indeed possible.

For adaptive sampling, one can obtain a lower bound via a parameter counting argument. Observing the th entry leads to a polynomial equation of the form . If , this system is underdetermined showing that observations are necessary for exact recovery, even under adaptive sampling. Thus, our algorithm enjoys sample complexity with optimal dependence on matrix dimensions.

5 Noisy Matrix Completion

Our algorithm for noisy matrix completion is an adaptation of the column subset selection (CSS) algorithm analyzed by Deshpande et al. [10]. The algorithm builds a candidate column space in rounds; at each round it samples additional columns with probability proportional to their projection on the orthogonal complement of the candidate column space.

To concretely describe the algorithm, suppose that at the beginning of the th round we have a candidate subspace . Then in the th round, we draw additional columns according to the distribution where the probability of drawing the th column is proportional to . Observing these columns in full and then adding them to the subspace gives the candidate subspace for the next round. We initialize the algorithm with . After rounds, we approximate each column with and concatenate these estimates to form .

The challenge is that the algorithm cannot compute the sampling probabilities without observing entries of the matrix. However, our results show that with reliable estimates, which can be computed from few observations, the algorithm still performs well.

We assume that the matrix can be decomposed as a rank matrix and a random gaussian matrix whose entries are independently drawn from . We write and assume that . As before, the incoherence assumption is crucial in guaranteeing that one can estimate the column norms, and consequently sampling probabilities, from missing data.

Theorem 6.

Let be the set of all observations over the course of the algorithm, let be the subspace obtained after rounds and be the matrix whose columns . Then there are constants such that:

can be computed from observations. In particular, if and , then there is a constant for which:

The main improvement in the result is in relaxing the assumptions on the underlying matrix . Existing results for noisy matrix completion require that the energy of the matrix is well spread out across both the rows and the columns (i.e. incoherence), and the sample complexity guarantees deteriorate significantly without such an assumption [5, 17]. As a concrete example, Negahban and Wainwright [20] use a notion of spikiness, measured as which can be as large as in our setup, e.g. when the matrix is zero except for on one column and constant across that column.

The choices of and noise variance rescaled by enable us to compare our results with related work [20]. Thinking of and the incoherence parameter as a constant, our results imply consistent estimation as long as . On the other hand, thinking of the spikiness parameter as a constant, [20] show that the error is bounded by where is the total number of observations. Using the same number of samples as our procedure, their results implies consistency as long as . For small (i.e. ), our noise tolerance is much better, but their results apply even with fewer observations, while ours do not.

6 Simulations

Figure 1: Probability of success curves for our noiseless matrix completion algorithm (top) and SVT (middle). Top: Success probability as a function of: Left: , the fraction of samples per column, Center: , total samples per column, and Right: , expected samples per column for passive completion. Bottom: Success probability of our noiseless algorithm for different values of as a function of , the fraction of samples per column (left), (middle) and (right).

We verify Corollary 3’s linear dependence on in Figure 1, where we empirically compute the success probability of the algorithm for varying values of and , the fraction of entries observed per column. Here we study square matrices of fixed rank with . Figure 1 shows that our algorithm can succeed with sampling a smaller and smaller fraction of entries as increases, as we expect from Corollary 3. In Figure 1, we instead plot success probability against total number of observations per column. The fact that the curves coincide suggests that the samples per column, , is constant with respect to , which is precisely what Corollary 3 implies. Finally, in Figure 1, we rescale instead by , which corresponds to the passive sample complexity bound [21]. Empirically, the fact that these curves do not line up demonstrates that our algorithm requires fewer than samples per column, outperforming the passive bound.

The second row of Figure 1 plots the same probability of success curves for the Singular Value Thresholding (SVT) algorithm [4]. As is apparent from the plots, SVT does not enjoy a linear dependence on ; indeed Figure 1 confirms the logarithmic dependency that we expect for passive matrix completion, and establishes that our algorithm has empirically better performance.

In the third row, we study the algorithm’s dependence on on square matrices. In Figure 1 we plot the probability of success of the algorithm as a function of the sampling probability for matrices of various rank, and observe that the sample complexity increases with . In Figure 1 we rescale the -axis by so that if our theorem is tight, the curves should coincide. In Figure 1 we instead rescale the -axis by corresponding to our conjecture about the performance of the algorithm. Indeed, the curves line up in Figure 1, demonstrating that empirically, the number of samples needed per column is linear in rather than the dependence in our theorem.

  Unknown Results time (s)   Unknown Results time (s)

Figure 3: Reconstruction error as a function of row space incoherence for our noisy algorithm (CSS) and the semidefinite program of [20].
Figure 2: Reconstruction error as a function of row space incoherence for our noisy algorithm (CSS) and the semidefinite program of [20].
Table 3: Computational results on large low-rank matrices. is the degrees of freedom, so is the oversampling ratio.
Figure 2: Reconstruction error as a function of row space incoherence for our noisy algorithm (CSS) and the semidefinite program of [20].
Table 2: Computational results on large low-rank matrices. is the degrees of freedom, so is the oversampling ratio.

To confirm the computational improvement over existing methods, we ran our matrix completion algorithm on large-scale matrices, recording the running time and error in Table 3. To contrast with SVT, we refer the reader to Table 5.1 in [4]. As an example, recovering a matrix of rank takes close to 2 hours with the SVT, while it takes less than 5 minutes with our algorithm.

For the noisy algorithm, we study the dependence on row-space incoherence. In Figure 3, we plot the reconstruction error as a function of the row space coherence for our procedure and the semidefinite program of Negahban and Wainwright [20], where we ensure that both algorithms use the same number of observations. It’s readily apparent that the SDP decays in performance as the row space becomes more coherent while the performance of our procedure is unaffected.

7 Conclusions and Open Problems

In this work, we demonstrate how sequential active algorithms can offer significant improvements in time, and measurement overhead over passive algorithms for matrix and tensor completion. We hope our work motivates further study of sequential active algorithms for machine learning.

Several interesting theoretical questions arise from our work:

  1. Can we tighten the dependence on rank for these problems? In particular, can we bring the dependence on down from to linear? Simulations suggest this is possible.

  2. Can one generalize the nuclear norm minimization program for matrix completion to the tensor completion setting while providing theoretical guarantees on sample complexity?

We hope to pursue these directions in future work.

References

  • [1] Dimitris Achlioptas and Frank Mcsherry. Fast computation of low-rank matrix approximations. Journal of the ACM (JACM), 54(2):9, 2007.
  • [2] Laura Balzano, Benjamin Recht, and Robert Nowak. High-dimensional matched subspace detection when data are missing. In Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, pages 1638–1642. IEEE, 2010.
  • [3] Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 968–977. Society for Industrial and Applied Mathematics, 2009.
  • [4] Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
  • [5] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
  • [6] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
  • [7] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on, 56(5):2053–2080, 2010.
  • [8] Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Coherent matrix completion. arXiv preprint arXiv:1306.2979, 2013.
  • [9] Mark A Davenport and Ery Arias-Castro. Compressive binary search. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 1827–1831. IEEE, 2012.
  • [10] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2:225–247, 2006.
  • [11] Silvia Gandy, Benjamin Recht, and Isao Yamada. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27(2):025010, 2011.
  • [12] Alex Gittens. The spectral norm error of the naive nystrom extension. arXiv preprint arXiv:1110.5305, 2011.
  • [13] David Gross. Recovering low-rank matrices from few coefficients in any basis. Information Theory, IEEE Transactions on, 57(3):1548–1566, 2011.
  • [14] Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix reconstruction. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1207–1214. SIAM, 2012.
  • [15] Jarvis D Haupt, Richard G Baraniuk, Rui M Castro, and Robert D Nowak. Compressive distilled sensing: Sparse recovery using adaptivity in compressive measurements. In Signals, Systems and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on, pages 1551–1555. IEEE, 2009.
  • [16] Jun He, Laura Balzano, and John Lui. Online robust subspace tracking from partial information. arXiv preprint arXiv:1109.3827, 2011.
  • [17] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. The Journal of Machine Learning Research, 99:2057–2078, 2010.
  • [18] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the nyström method. The Journal of Machine Learning Research, 98888:981–1006, 2012.
  • [19] Béatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The annals of Statistics, 28(5):1302–1338, 2000.
  • [20] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 2012.
  • [21] Benjamin Recht. A simpler approach to matrix completion. The Journal of Machine Learning Research, 7777777:3413–3430, 2011.
  • [22] Ryota Tomioka, Kohei Hayashi, and Hisashi Kashima. Estimation of low-rank tensors via convex optimization. arXiv preprint arXiv:1010.0789, 2010.
  • [23] Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, and Hisashi Kashima. Statistical performance of convex tensor decomposition. In Advances in Neural Information Processing Systems, pages 972–980, 2011.
  • [24] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.

Appendix A Proof of Corollary 3

Corollary 3 is considerably simpler to prove than Theorem 2, so we prove the former in its entirety before proceeding to the latter. To simplify the presentation, a number of technical lemmas regarding incoherence and concentration of measure are deferred to sections E and F, respectively.

The proof begins by ensuring that for every column , if then with high probability. This property is established in the following lemma:

Lemma 7.

Suppose that and a column but . If then with probability , . If then with probability , .

Proof of Lemma 7.

Decompose where and . We can immediately apply Theorem 4 and are left to verify that the left hand side of Equation 5 is strictly positive. Since we know that . Then:

When . Here we used that since . For :

Whenever . Finally, with the bounds on and , the expression in Equation 5 is strictly positive when since . Plugging in the definition of we require:

Which certainly holds when , concluding the proof. ∎

It is easy to see that if then deterministically and our algorithm does not further sample these columns. We must verify that these columns can be recovered exactly, and this amounts to checking that is invertible. Fortunately, this was established as a lemma in [2], and in fact, the failure probability is subsumed by the probability in Theorem 4. Now we argue for correctness: there can be at most columns for which since . For each of these columns, from Lemma 7, we know that with probability . By a union bound, with probability all of these tests succeed, so the subspace at the end of the algorithm is exactly the column space of , namely . All of these columns are recovered exactly, since we completely sample them.

The probability that the matrices are invertible is subsumed by the success probability of Theorem 4, except for the last matrix. In other words, the success of the projection test depends on the invertibility of these matrices, so the fact that we recovered the column space implies that these matrices were invertible. The last matrix is invertible except with probability by Lemma 18.

If these matrices are invertible, then since , we can write and we have:

So these columns are all recovered exactly. This step only adds a factor of to the failure probability, leading to the final term in the failure probability of the theorem.

For the running time, per column, the dominating computational costs involve the projection and the reconstruction procedure. The projection involves several matrix multiplications and the inversion of a matrix, which need not be recomputed on every iteration. Ignoring the matrix inversion, this procedure takes at most per column for a total running time of . At most times, we must recompute , which takes , contributing a factor of to the total running time. Finally, we run the Gram-Schmidt process once over the course of the algorithm, which takes time. This last factor is dominated by .

Appendix B Proof of Theorem 2

We first focus on the recovery of the tensor in total, expressing this in terms of failure probabilities in the recursion. Then we inductively bound the failure probability of the entire algorithm. Finally, we compute the total number of observations. For now, define to be the failure probability of recovering a -order tensor.

By Lemma 13, the subspace spanned by the mode- tensors has incoherence at most and rank at most and each slice has incoherence at most . By the same argument as Lemma 7, we see that with the projection test succeeds in identifying informative subtensors (those not in our current basis) with probability . With a union bound over these subtensors, the failure probability becomes , not counting the probability that we fail in recovering these subtensors, which is .

For each order tensor that we have to recover, the subspace of interest has incoherence at most and with probability we correctly identify each informative subtensor as long as . Again the failure probability is .

To compute the total failure probability we proceed inductively. since we completely observe any one-mode tensor (vector). The recurrence relation is:

(8)

which solves to:

(9)

We also compute the sample complexity inductively. Let denote the number of samples needed to complete a -order tensor. Then and:

(10)

So that is upper bounded as:

The running time is computed in a similar way to the matrix case. Assume that the running time to complete an order tensor is:

Note that this is exactly the running time of our Algorithm in the matrix case.

Per order subtensor, the projection and reconstructions take , which in total contributes a factor of . At most times, we must complete an order subtensor, and invert the matrix . These two together take in total:

Finally the cost of the Gram-schmidt process is which is dominated by the other costs. In total the running time is:

since . Now plugging in that , the terms in the second sum are each meaning that the sum is . This gives the computational result.

Appendix C Proof of Theorem 6

We will first prove a more general result and obtain Theorem 6 as a simple consequence.

Theorem 8.

Let where and . Let denote the best rank approximation to . Assume that is rank and . For every sample a set of size at each of the rounds of the algorithm and compute as prescribed. Then with probability :

and the algorithm has expected sample complexity:

The proof of this result involves some modifications to the analysis in [10]. We will follow their proof, allowing for some error in the sampling probabilities, and arrive at a recovery guarantee. Then we will show how these sampling probabilities can be well-approximated from limited observations.

The first Lemma analyzes a single round of the algorithm, while the second gives an induction argument to chain the first across all of the rounds. These are extensions of Theorems 2.1 and Theorems 1.2, respectively, from [10].

Lemma 9.

Let and let be a subspace of . Let and let be a random sample of columns of , sampled according to the distribution with:

Then with probability we have:

Where denotes a projection on to the best -dimensional subspace of and is the best rank approximation to .

Proof.

The proof closely mirrors that of Theorem 2.1 in [10]. The main difference is that we are using an estimate of the correct distribution, and this will result in some additional error.

For completeness we provide the proof here. We number the left (respectively right) singular vectors of as () and use subscripts to denote columns. We will construct vectors and use them to upper bound the projection. In particular we have:

so we can exclusively work with .

For each and for each define:

That is the th column of the residual , scaled by the th entry of the th right singular vector, and the sampling probability. Defining , we see that:

Defining and using the definition of , it is easy to verify that . It is also easy to see that .

We will now proceed to bound the second central moment of .

The first term can be expanded as:

So that the second central moment is:

Now we use the probabilities to evaluate each term in the summation:

This gives us an upper bound on the second central moment:

To complete the proof, let and define the matrix . Since , the column space of is contained in so .