Noisy Tensor Completion for Tensors with a Sparse Canonical Polyadic Factor

Noisy Tensor Completion for Tensors with a Sparse Canonical Polyadic Factor

Abstract

In this paper we study the problem of noisy tensor completion for tensors that admit a canonical polyadic or CANDECOMP/PARAFAC (CP) decomposition with one of the factors being sparse. We present general theoretical error bounds for an estimate obtained by using a complexity-regularized maximum likelihood principle and then instantiate these bounds for the case of additive white Gaussian noise. We also provide an ADMM-type algorithm for solving the complexity-regularized maximum likelihood problem and validate the theoretical finding via experiments on synthetic data set.

Tensor decomposition, noisy tensor completion, complexity-regularized maximum likelihood estimation, sparse CP decomposition, sparse factor models.

1Introduction

The last decade has seen enormous progress in both the theory and practical solutions to the problem of matrix completion, in which the goal is to estimate missing elements of a matrix given measurements at some subset of its locations. Originally viewed from a combinatorial perspective [1], it is now usually approached from a statistical perspective in which additional structural assumptions (e.g., low-rank, sparse factors etc) not only make the problem tractable but allow for provable error bounds from noisy measurements [2]. Tensors, which we will view as multi-way arrays, naturally arise in slew of practical applications in the areas of signal processing, computer vision, neuroscience, etc. [9]. Often in practice tensor data is collected in a noisy environment and suffers from missing observations. Given the success of matrix completion methods, it is no surprise that recently there has been a lot of interest in extending the successes of matrix completion to tensor completion problem [11].

In this work we consider the general problem of tensor completion. Let be the tensor we wish to estimate and suppose we collect the noisy measurements at subset of its location . The goal of tensor completion problem is to estimate the tensor from noisy observations . This problem is naturally ill-posed without any further assumption on the tensor we wish to estimate. We focus on structured tensors that admit “sparse CP decomposition” by which we mean that one of the canonical polyadic or CANDECOMP/PARAFAC (CP)-factors (defined in section ) is sparse. Tensors admitting such structure arise in many applications involving electroencephalography (EEG) data, neuroimaging using functional magnetic resonance imaging (MRI), and many others [14].

1.1Our Contributions

Our main contribution is encapsulated by Theorem ? which provides general estimation error bounds for noisy tensor completion via complexity-regularized maximum likelihood estimation[18] for tensors fitting our data model. This theorem can be instantiated for specific noise distributions of interest, which we do for the case when the observations are corrupted with additive white Gaussian noise. We also provide a general ADMM-type algorithm which solves an approximation to the problem of interest and then provide numerical evidence validating the statistical convergence rates predicted by Theorem ?.

1.2Relation with existing works

A common theme of recent tensor completion works is modifying the tools that have been effective in tackling the matrix completion problem to apply to tensors. For example, one could apply matrix completion results to tensors directly by matricizing the tensors along various modes and minimizing the sum or weighted sum of their nuclear norms as a convex proxy for tensor rank [20]. Since the nuclear norm is computationally intractable for large scale data, matrix completion via alternating minimization was extended to tensors in [23].

In contrast to these works, in this paper we consider the noisy completion of tensors that admit a CP decomposition with one of the factors being sparse. Recently, the completion of tensors with this model was exploited in the context of time series prediction of incomplete EEG data [13]. Our work is focussed on providing recovery guarantees and a general algorithmic framework and draws inspiration from recent work on noisy matrix completion under a sparse factor model [8] and extends it to tensors with a sparse CP factor.

1.3Outline

After an overview of the notation used in this paper in section Section 2 we present the problem setup. In section Section 3 we present our main theorem and instantiate it for the case of Gaussian noise. In Section 4 we provide the algorithmic framework to solve the complexity regularized maximum likelihood estimation. Numerical experiments are provided in section Section 6, followed by a brief discussion and future research directions in Section 7.

1.4Notation

Given two continuous random variables and defined on the same probability space and with absolutely continuous with respect to , we define the Kullback-Leibler divergence (KL-divergence) of from to be

If is not absolutely continuous with respect to , then define . The Hellinger affinity of two distributions is similarly defined by

We will denote vectors with lower-case letters, matrices using upper-case letters and tensors as underlined upper-case letters (e.g., and , respectively). Furthermore, for any vector (or matrix) define to be the number of non-zero elements of and to denote maximum absolute of . Note that is not the induced norm of the matrix . Entry of tensor will be denoted by . For a tensor we define its Frobenius norm in analogy with the matrix case as the squared two norm of its vectorization and its maximum absolute entry as . Finally, we define the canonical polyadic or CANDECOMP/PARAFAC (CP) decomposition of a tensor to be a representation

where and are the columns of and , respectively, denotes the tensor outer product such that , and is the shorthand notation of in terms of its CP factors. The parameter is an upper bound on the rank of (we refer the reader to [10] for a comprehensive overview of tensor decompositions and their uses). For a given tensor and CP decomposition define as the maximum dimension of its CP factors and number of latent factors.

2Problem Setup

2.1Data model

Let be the unknown tensor whose entries we wish to estimate. We assume that admits a CP decomposition such that the CP factors , , are entry-wise bounded: , , . Furthermore, we will assume that is sparse . Then can be decomposed as follow

is also entry-wise bounded, say by 1. Such tensors have a rank upper bounded by .

2.2Observation setup

We assume that we measure a noisy version of at some random subset of the entries . We generate via an independent Bernoulli model with parameter as follows: first generate i.i.d. Bernoulli random variables with and then the set is obtained as . Conditioned on , in the case of an additive noise model we obtain noisy observations at the locations of as follows

where ’s are the i.i.d noise entries.

2.3Estimation procedure

Our goal here is to obtain an estimate for full true tensor using the noisy sub-sampled measurement . We pursue the complexity-regularized maximum likelihood to achieve this goal. For this we first note that the observations have distribution parameterized by the entries of the true tensor and the overall likelihood is given by

where is the pdf of observation which depends on the pdf of the noise and is parametrized by . We use the shorthand notation to denote the entries of the tensor sampled at the indices in .

Using prior information that is sparse, we regularize with respect to the sparsity of and obtain the complexity-regularized maximum likelihood estimate of as given below

where is the regularization parameter and is a class of candidate estimates. Specifically, we take to be a finite class of estimates constructed as follows: first choose some , and set and construct to be the set of all matrices whose elements are discretized to one of uniformly spaced between , similarly construct to be the set of all matrices whose elements are discretized to one of uniformly spaced between , finally be the set of matrices whose elements are either zero or are discretized to one of uniformly spaced between . Then, we let

and we let be any subset of .

3Main result

In this section we present the main result in which we provide an upper bound on the quality of the estimate obtained by solving .

The proof appears in the appendix Section 9.1.

The above theorem extends the main result of [8] to the tensor case. It states a general result relating the log affinity between the distributions parameterized by the estimated tensor and the ground truth tensor. Hellinger affinity is a measure of distance between two probability distributions which can be used to get bounds on the quality of the estimate. As in [8], the main utility of this theorem is that it can be instantiated for noise distributions of interest such as Gaussian, Laplace and Poisson. Note that since the estimation procedure depends only on the likelihood term, the above theorem can also be extended to non-linear observation models such as 1-bit quantized measurements [8]. We next demonstrate the utility of the above theorem to present error guarantees when the additive noise follows a Gaussian distribution.

3.1Gaussian Noise Case

We examine the implications of Theorem ? in a setting where observations are corrupted by independent additive zero-mean Gaussian noise with known variance. In this case, the observations are distributed according to a multivariate Gaussian density of dimension whose mean corresponds to the tensor entries at the sample locations and with covariance matrix , where is the identity matrix of dimension . That is,

In order to apply Theorem ? we choose as:

Then, we fix , and obtain an estimate according to with the value chosen as

In this setting we have the following result.

The proof appears in appendix Section 9.3.

4The Algorithmic framework

In this section we propose an ADMM-type algorithm to solve the complexity regularized maximum likelihood estimate problem in . We note that the feasible set problem in is discrete which makes the algorithm design difficult. Similar to [8] we drop the discrete assumption in order to use continuous optimization techniques. This may be justified by choosing a very large value of and by noting that continuous optimization algorithms, when executed on a computer, use finite precision arithmetic, and thus a discrete set of points. Hence, we consider the design of an optimization algorithm for the following problem:

We form the augmented Lagrangian for the above problem

where is Lagrangian vector of size for the tensor equality constraint and are indicator functions of the sets , , , respectively2. Starting from the augmented Lagrangian we extend the ADMM-type algorithm proposed in [8] to the tensor case as shown in Algorithm ?.

5The algorithm

The update in Algorithm ? is separable across components and so it reduces to scalar problems. Furthermore, the scalar problem is closed-form for and is a proximal-type step for . This is a particularly attractive feature because many common noise densities (e.g., Gaussian, Laplace) have closed-form proximal updates [8]. The and updates can be converted to a constrained least squares problem and can be solved via projected gradient descent. We solve the update via iterative hard thresholding. Although the convergence of this algorithm to a stationary point remains an open question and a subject of future work, we have not encountered problems with this in our simulations.

6Numerical Experiments

In this section we include simulations which corroborate our theorem. For each experiment we construct the true data tensor by individually constructing the CP factors (as described below), where the magnitudes of entries of the true factors , , and are bounded in magnitude by and respectively. For the purposes of these experiments we fix and .

For a given the true CP factors were generated as random matrices of dimensions , , with standard Gaussian entries. We then projected the entries of the and matrices so that and . For the matrix we first project entry-wise to the interval and then pick entries uniformly at random and zero out all other entries so that we get the desired sparsity . From these tensors the tensor was calculated as as in .

We then take measurements at a subset of entries following a Bernoulli sampling model with sampling rate and corrupt our measurements with additive white Gaussian noise of variance to obtain the final noisy measurements. The noisy measurements were then used to calculate the estimate by solving (an approximation to) the complexity regularized problem in using algorithm ?. Note that for Gaussian noise the negative log-likelihood in problem reduces to a squared error loss over the sampled entries. Since in practice the parameters , , are not known a priori we will assume we have an upper bound for them and in our experiments set them as . Further, we also assume that is known a priori.

In Figure 1 we show how the log per entry squared error decays as a function of log sampling rate for in the paper and a fixed sparsity level . The plot is obtained after averaging over trials to average out random Bernoulli sampling at given sampling rate and noise. Each plot corresponds to a single chosen value of , selected as the value that gives a representative error curve (e.g., one giving lowest overall curve, over the range of parameters we considered). Our theoretical results predict that the error decay should be inversely proportional to the sampling rate when viewed on a log-log scale, this corresponds to the slope of . The curve of and are shown in blue solid line and red dotted line. For both the cases the slope of curves is similar and it is approximately . Therefore these experimental results validate both the theoretical error bound in corollary ? and the performance of our proposed algorithm.

Figure 1: Plot for log per-entry approximation error \log
\left(\frac{\|  \hat{\underline{X}} -  \underline{X}^*\|_F^2}{n_1n_2n_3}\right) vs the log sampling rate: \log \left(\gamma\right) for the two ranks F = 5,
15. The slope at the higher sampling rates is approximately -1 (the rate predicted by our theory) in both cases.
Figure 1: Plot for log per-entry approximation error vs the log sampling rate: for the two ranks . The slope at the higher sampling rates is approximately (the rate predicted by our theory) in both cases.

7Conclusion and Future Directions

In this work we extend the statistical theory of complexity-penalized maximum likelihood estimation developed in [8] to noisy tensor completion for tensors admitting CP decomposition with a sparse factor. In particular, we provide theoretical guarantees on the performance of sparsity-regularized maximum likelihood estimation under a Bernoulli sampling assumption and general i.i.d. noise. We then instantiate the general result for the specific case of additive white Gaussian noise. We also provided an ADMM-based algorithmic framework to solve the complexity-penalized maximum likelihood estimation problem and provide numerical experiments to validate the theoretical bounds on synthetic data.

Obtaining error bounds for other noise distributions and non-linear observation setting such 1-bit quantized observations is an interesting possible research direction. Extending the main result to approximately sparse CP factor or to tensors with multiple sparse CP factor are also important directions for future research.

8Acknowledgements

We thank Professor Nicholas Sidiropoulos for his insightful guidance and discussions on tensors which helped in completion of this work. Swayambhoo Jain and Jarvis Haupt were supported by the DARPA Young Faculty Award, Grant N66001-14-1-4047. Alexander Gutierrez was supported by the NSF Graduate Research Fellowship Program under Grant No. 00039202.

9Appendix

9.1Proof of Main Theorem

The proof of our main result is an application of the following general lemma.

The proof appears in Appendix Section 9.2.

For using the result in Lemma ? we need to define penalties on candidate reconstructions of , so that for every subset of the set specified in the conditions of Theorem ? the summability condition holds. To this end, we will use the fact that for any we always have ; thus, it suffices for us to show that for the specific set described in , the penalty satisfies the Kraft-McMillan inequality:

The Kraft-Mcmillan Inequality is automatically satisfied if we set the to be the code length of some uniquely decodable binary code for the elements [25].

We utilize a common encoding strategy for encoding the elements of and . We encode each entry of the matrices using bits in this manner the total number of bits needed to code any elements in and is and respectively. Since the elements of set are sparse we follow a two step procedure: first we encode the location of the non-zero elements using bits where and then we encode the entry using bits. Now, we let be the set of all such with CPD factors ,, , and let the code for each be the concatenation of the (fixed-length) code for followed by (fixed-length) code for followed by the (variable-length) code for . It follows that we may assign penalties to all whose lengths satisfy

By construction such a code is uniquely decodable, since by the Kraft McMillan inequality we have . Further, since this also satisfies the inequality in in Lemma ? is satisfied for sa defined in statement of the Theorem ?. Now for any set and using coding strategy described above, the condition in Lemma is satisfied. So for randomly subsampled and noisy observations our estimates take the form

Further, when satisfies , we have

Finally, we let and using the relation that

which follows by our selection of and and the fact that and . Using the condition and in Lemma ? it follows that for

the estimate

satisfies the bound in Thereom ?.

9.2Proof of Lemma

The main requirement for the proof of this lemma is to show that our random Bernoulli measurement model is “good” in the sense that it will allow us to apply some known concentration results. Let be an upper bound on the KL-divergence of from over all elements :

Similarly, let be an upper bound on negative two times the log of the Hellinger affinities between the same:

Let be the expected total number of measurements and to be the ratio of measured entries to total entries. Given any define the “good” set as the subset of all possible sampling sets that satisfy a desired property:

We show that an Erdós-Renyi model with parameter will be “good” with high probability in the following lemma.

Note that is defined in terms of an intersection of two events, define them to be

and

We apply the union bound to find that

and will prove the theorem by showing that each of the two probabilities on the right-hand side are less than , starting with .

Since the observations are conditionally independent given , we know that for fixed ,

where Bernoulli(). We will show that random sums of this form are concentrated around its mean using the Craig-Bernstein inequality .

The version ofthe Craig-Bernstein inequality that we will use states: let be random variables such that we have the uniform bound for all . Let and be such that . Then

To apply the Craig-Bernstein inequality to our problem we first fix and define . Note that . We also bound the variance via

Then let and in to get that

Now use the fact that by definition to cancel out the square term to get:

Finally, we define , and simplify to arrive at

for any .

To get a uniform bound over all define and use the bound in with and apply the union bound over the class to find that

An similar argument (applying Craig-Bernstein and a union bound) can be applied to to obtain

This completes the proof of lemma ?.

Given lemma ?, the rest of the proof of lemma ? is a straightforward extension of the already-published proof of lemma A.1 in [8].

9.3Proof of Corollary

We first establish a general error bound, which we then specialize to the case stated in the corollary. Note that for as specified and any , using the model we have

for any fixed . It follows that . Further. as the amplitudes of entries of and all upper bounded by , it is easy to see that we may choose . Also, for any and any fixed it is easy to show that in this case

so that . It follows that

Now for using Theorem ?, we first substitute the value of to obtain the following condition on

Above condition implies that the specific choice of given is a valid choice to use if we want to invoke Theorem ?. So fixing as given and using Theorem ?, the sparsity penalized ML estimate satisfies the per-element mean-square error bound

Notice that the above inequality is sort of an oracle type inequality because it implies that for any we have

We use this inequality for a specific candidate reconstruction of the form where the entries of s are the closest discretized surrogates of the entries of , are the closest discretized