A Provably Correct and Robust Algorithm for Convolutive Nonnegative Matrix Factorization\@footnotetext©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

# A Provably Correct and Robust Algorithm for Convolutive Nonnegative Matrix Factorization\@footnotetext©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Anthony Degleris  and Nicolas Gillis Department of Electrical Engineering, Stanford University, Stanford CA, USA. E-mail: degleris@stanford.edu.Department of Mathematics and Operational Research, Faculté Polytechnique, Université de Mons, Rue de Houdain 9, 7000 Mons, Belgium. NG acknowledges the support by the Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Project no O005318F-RG47, and by the European Research Council (ERC starting grant no 679515). E-mail: nicolas.gillis@umons.ac.be.
###### Abstract

In this paper, we propose a provably correct algorithm for convolutive nonnegative matrix factorization (CNMF) under separability assumptions. CNMF is a convolutive variant of nonnegative matrix factorization (NMF), which functions as an NMF with additional sequential structure. This model is useful in a number of applications, such as audio source separation and neural sequence identification. While a number of heuristic algorithms have been proposed to solve CNMF, to the best of our knowledge no provably correct algorithms have been developed. We present an algorithm that takes advantage of the NMF model underlying CNMF and exploits existing algorithms for separable NMF to provably find the unique solution (up to permutation and scaling) under separability-like conditions. Our approach guarantees the solution in low noise settings, and runs in polynomial time. We illustrate its effectiveness on synthetic datasets, and on a singing bird audio sequence.

## 1 Introduction

Nonnegative matrix factorization (NMF) is a standard unsupervised learning technique for analyzing large datasets. Given an matrix , NMF seeks a matrix (where the inequality is to be interpreted elementwise) and a matrix such that and . NMF has been successfully applied to a number of practical problems; these include hyperspectral unmixing, text mining, audio source separation, and image processing; see [5, 13, 11] and the references therein. Let us discuss three important limitations of NMF, each of which we will address in this paper.

A first limitation of NMF is that it fails to capture local correlations in the data. For example, in imaging applications a column of may represent a pixel, and adjacent columns will often correspond to pixels adjacent to one another in the image. Neighboring pixels tend to be quite similar, especially in low contrast images. In audio or neuroscience datasets, each column is a certain instant in time, and therefore neighboring columns are often highly correlated. To capture these local correlations, Smaragdis [32] proposed a convolutive variant of NMF, known as convolutive NMF (CNMF). CNMF attempts to find matrices of size and a matrix of size such that , where is a square matrix that shifts the columns of by places to the right, zero-padding the leftmost columns. Explicitly, has ones on its th upper diagonal and zeros elsewhere. It is often convenient to instead define an tensor and denote the resulting matrix when the first index is fixed to . Then another, perhaps more intuitive, definition of CNMF is to equivalently write , where is the 2D-convolution operator defined as , with and , is the resulting matrix when the third index is fixed to , and is the th row of . Thus, CNMF is a factorization of into a sum of 2D-convolutions. This characterization is visually apparent in Figure 1.

Since its conception, CNMF has been found particularly useful for audio source separation [32, 31, 35] and neural sequence identification [27]. In general, CNMF will provide a more concise and interpretable factorization than NMF whenever the columns of exhibit local correlations, which is often the case when the columns represent points in time or space.

A second limitation of NMF is that solving NMF problems is challenging in practice. In fact, it is in general NP-hard [33], leading researchers to rely on a number of heuristic algorithms; see, e.g., [13] and the references therein. As CNMF is a generalization of NMF (the model is exactly NMF when ), it inherits problems related to computational intractability. The first algorithm proposed was an averaged multiplicative update rule [32], which treats each pair as an NMF and updates them using an NMF multiplicative update [23], averaging the updates for . Since then, numerous other algorithms have been proposed to fit CNMF, including multiplicative updates [31, 34, 18, 9] (which are derived using majorization minimization), projected alternating least squares [30], alternating nonnegative least squares (ANLS; [7]), and hierarchical alternating least squares (HALS; [7]). Many of these algorithms are generalizations of algorithms for NMF and perform well on both synthetic and experimental data. Several of them also have convergence guarantees to stationary points. However, all the aforementioned algorithms are heuristics, in the sense that there is no guarantee that they reach a global minimum.

A third limitation of NMF is that, in general, the NMF of a matrix is non-unique, that is, there are other such that , but and . We therefore cannot be certain that the recovered factors correspond to the true sources that generated the data. The same problem arises with CNMF; see Section 5.1 for some examples. In practice, most researchers use additional regularization terms to promote some structure in the sought solution such as sparsity [20]. Unfortunately, identifiability results for NMF are rather scarce; see the recent survey [11]. As far as we know, the only NMF model that is both tractable and identifiable is separable NMF [3]. When trying to identify an NMF , separable NMF makes the additional assumption that the columns of are contained somewhere in , that is, for some index set . This assumption is called the separability assumption. This terminology was introduced in the paper of Donoho and Stodden [8] where it was shown that NMF has a unique solution under the condition that and additional sparsity conditions on . However, this NMF model dates back from the 1990’s in the hyperspectral imaging literature where the separability assumption is referred to as the pure-pixel assumption; see [26] and the references therein. Since the paper of Arora et al. [3], which proved that separable NMF model is unique, robust to noise, and computable in polynomial time, numerous algorithms for separable NMF have been proposed [14]. Moreover, this model has been used successfully in applications such as topic modeling [2], community detection [29], and the analysis of time-resolved Raman spectra [25], to cite a few. Of course, the separability assumption is rather strong and does not hold in all applications. Nevertheless, it has also be shown to be a powerful way to initialize more sophisticated NMF models, such as minimum-volume NMF models, which are identifiable but suffer from intractability [11]. One of the most popular and powerful separable NMF algorithm is the successive projection algorithm (SPA) [1] which is guaranteed to recover the columns of , even in the presence of noise [16]. SPA is a greedy algorithm that sequentially identifies the columns of using successive orthogonal projections (see Algorithm 3 and Appendix A for more details.). SPA is closely related to the modified Gram-Schmidt algorithm with column pivoting, and has rediscovered many times; see the discussion in [13].

### 1.1 Contribution and Outline of the Paper

In this paper, we consider conditions under which the CNMF problem can be provably solved in polynomial time, even in the presence of noise, and has a unique solution (up to permutation and scaling of the rank-one a factors). These conditions generalize the separability assumption discussed in the previous paragraph. We propose an algorithm that reduces CNMF to NMF with linear constraints and takes advantage of existing separability-based methods for NMF. This algorithm provably finds the solution to CNMF in polynomial time, both for exact problems and problems with bounded noise. In particular, we utilize SPA to estimate the columns of , then apply estimation, clustering, and sorting techniques to uncover the convolutive structure. We later generalize our approach to use any separable NMF algorithm and show how the choice of the separable NMF algorithm affects our assumptions, run-time, and noise tolerance. To the best of our knowledge, this is the first algorithm to provably solve the CNMF problem in either the absence or presence of noise. Hence, under separability-like assumptions, our approach resolved the three important limitations of NMF mentioned in the introduction.

The paper is organized as follows. In Section 2, we state the CNMF problem and formally define the convolutive separability assumptions. In Section 3, we propose the Locate-Estimate-Cluster-Sort Algorithm (LECS) for recovery and show that LECS provably finds the solution to the CNMF problem under the convolutive separability assumptions. In particular, Theorem 2 guarantees that LECS will find the unique optimal solution (up to permutation and scaling) in the absence of noise. Theorem 3 generalizes this result to problems with bounded noise. The proof of Theorem 3 is deferred to the Appendix.

##### Notation

The th entry of a vector is denoted by . For matrices , we use and to denote the th row of and th column of , respectively. We will also use lower case letters for columns and rows, but define them explicitly first. The entry in the th row and the th column of will be denoted . A nonnegative vector or a nonnegative matrix are denoted using and . Define as a sqaure -by- matrix with 1 on its th upper diagonal and zeros elsewhere. For vectors , the -norm for is defined as . The maximum and minimum -norms of a column of are denoted by and , respectively. We also define and . The Frobenius norm is denoted as . We denote to be the largest singular value of , and to be the smallest singular value of . The condition number of a matrix induced by the 2-norm is denoted as . Given a matrix , the diagonal matrix is defined by .

## 2 Problem Setup and the Separability Conditions

Suppose there is a matrix generated as , where for , and . Consider , where is some matrix of noise. Given , the convolutive NMF problem (or CNMF problem) is to approximately recover (up to scaling or permutation). When and hence , this problem amounts to finding an exact convolutive NMF.

### 2.1 Reformulation as Constrained NMF

Our approach will take advantage of existing literature on NMF. In particular, we look to separable NMF algorithms and attempt to extend them to the convolutive case. In this vein, it is useful to reformulate the convolutive NMF model as an NMF model with linear constraints of .

Consider the sum . Define for . Now define

 V=[W1…WL] and % G=⎡⎢ ⎢⎣H1⋮HL⎤⎥ ⎥⎦. (1)

It follows that . Now we may think of the exact CNMF problem as a restriction of the exact NMF problem; Given , our goal is to find and such that and is defined as in (1), that is, each block is a shifted version of the first block . The above formulation further elucidates the relationship between NMF and CNMF. When , CNMF reduces to NMF. On the other hand, any CNMF (given by ) is also an NMF (given by ). Therefore, when the found from NMF have the form in (1), a CNMF has also been identified. Given the reformulation of (1), our approach is to utilize separable NMF algorithms to locate the columns of within and then estimate . Once these matrices have been identified up to permutation and scaling, we use clustering and sorting methods to identify and . We name this approach the Locate-Estimate-Cluster-Sort Algorithm (LECS); see Algorithm 1 which is described in details in Section 3.

### 2.2 Convolutive Separability

To introduce a notion of separability for the convolutive NMF model, we need to address two relevant details that are not present in the standard NMF model. First, to handle time dependencies, we need some notion of distance up to a shift. The following definition formalizes this notion mathematically.

###### Definition 1 (L-shift similarity).

Consider two vectors . The -shift cosine similarity between and is denoted and given by

 maxℓ=0,…,L−1max(cos(STℓgi,gj), cos(gi,STℓgj)),

where . This is exactly the cosine of the minimum angle between over all shifts of length .

The utility of the -shift similarity comes from the fact that is equivalent to the statement that either or for some and for some scalar .

The second detail is minor but essential. Separable NMF algorithms like SPA require that each column of lie in the convex hull of , that is, for all (see Appendix A for more details). Nevertheless, when this does not hold we can still scale the columns of and to sum to one, and thus apply SPA. This is not the case in the convolutive NMF model, since the magnitude of the sequence may change over time (for example, a crescendo in an audio sequence). Thus, we will have to allow arbitrary positive scaling of by some positive diagonal matrix , and apply SPA to . With these details in mind, we can now formulate a concept of separability for the CNMF problem. We first provide a definition, then explain each condition in more details.

###### Definition 2 (Convolutive separable).

Given a CNMF problem with inputs where , and , we say the problem is convolutive separable with respect to , where , , and is a diagonal matrix with strictly positive diagonal entries, if the following conditions are satisfied:

(A) [Separable] For each , the vector appears as a scaled column of . Explicitly, there are several equivalent ways to express this condition.

The matrix satisfies for some set .

The matrix satisfies for some and permutation matrix .

(B) [Sequentially unique]

For any two rows of , the -shift similarity between them is .

For any row of and any , we have when .

(C) [Full Rank] The matrix is full rank, that is, has rank .

(D) [Bounded noise] The noise matrix satisfies .

Definition 2 guarantees a unique solution in the following sense.

###### Theorem 1 (Unique Solution).

Suppose satisfies (A), (B), and (C). Then any other CNMF given by that satisfies (A), (B), (C), and differs at most by permutation and scaling, in the sense that for a positive diagonal matrix and a permutation matrix .

The proof of this result follows directly from Theorem 2, since our proposed algorithm (namely, LECS described in Algorithm 1) is deterministic and guaranteed to recover the underlying CNMF up to permutation and scaling. The same approach entails the uniqueness of a solution to a noisy CNMF problem among solutions satisfying the conditions of Theorem 3.

The first condition [Separable] is inherited directly from the separable NMF literature and used only to guarantee that a separable NMF algorithm can identify the columns of . One notable exception is that in separable NMF problems, one usually requires that . In this paper, we relax this condition to for some diagonal matrix with strictly positive diagonal entries. This weaker constraint allows the columns of to be arbitrarily scaled in the matrix . However, this more general condition makes the problem no harder, since we can use a separable NMF algorithm to identify the columns of and use this scaled matrix throughout the rest of our procedure. In Section 4, we show that [Separable] can actually be weakened to the more general sufficiently scattered condition.

The second condition, [Sequentially unique]  tells us that each row of must differ from all other rows and their shifted variants. Moreover, it tells us the rows must be distinct from its own shifted variants. Since we are generally unconcerned with the scale of a row of , the cosine of the angle between two vectors serves as a useful scale-invariant measurement of similarity (if the vectors have mean zero, it is exactly the correlation between the two vectors). [Sequentially unique] will be an important condition after we have a permuted estimate of , since we will need to rearrange the rows to obtain a matrix that satisfies the constraints of (1). In practice, this is a very reasonable assumption; all we require is (a) that any two factors must not be highly correlated (in which case we should just have one factor), and (b) that the factor is not periodic with a period smaller than (in which case we can choose a smaller ).

In our approach, [Full Rank] guarantees that SPA terminates successfully and locates the columns of . We also use it to bound the error when estimating via nonnegative least squares. However, other algorithms like the Successive Nonnegative Projection Algorithm (SNPA) from [12] do not require [Full Rank]. Similarly, [24] defines least squares robustness results in the rank deficient case. Although we assume [Full Rank] in our results to simplify our analysis, it can be weakened to the condition that no column of is contained in the convex cone generated by the other columns of . We elaborate on this in Section 4.

[Bounded noise] bounds the 1-norm of any column of the noise matrix . When , this is clearly satisfied. Otherwise, we will later see that if is sufficiently small, we can still recover noisy estimates of . Note that this bound differs slightly from the noise bounds in many separable NMF algorithms. In particular, most separable NMF algorithms require bounds on the 2-norm of any column of the noise matrix , whereas we bound the 1-norm. Since for any vector , we make a stronger assumption on the noise. This stronger condition is intimately related to the following. In separable NMF, we require . In practical applications where this might not hold, we scale the columns of so that this assumption is satisfied and show that the new problem is equivalent; see for example [16]. This scaling trick does not apply to CNMF however, since the matrix is difficult to appropriately scale in the expression . In Appendix A, we explain this scaling trick and its limitations in more details.

Is convolutive separability reasonable in practice? In challenging real-world scenarios, the convolutive separability condition will most likely be violated, because several sources will overlap in all time windows. Also, the noise level will be typically higher than what our bounds allow (see Theorem 3). These two issues are analogous to those encountered by separable NMF. Nevertheless, as explained in the introduction, separable NMF has been used successful in many challenging real-world problems, either directly or as an initialization strategy. Accordingly, we believe many real-world signals can be approximated well under the convolutive separability assumption because, although several sources are active in all time windows, some time windows will contain a single source that is dominant. Moreover, as it will be shown in the experiments, our approach leads to efficient initialization strategies in challenging scenarios. Additionally, our approach can be generalized in a number of different ways (see Section 4). For example, the separability assumption can be relaxed to the sufficiently scattered condition by using minimum-volume NMF instead of separable NMF.

## 3 Algorithm Description and Recovery Guarantee

In this section, we first describe in details our proposed algorithm (LECS, Algorithm 1) in Section 3.1, and then prove its correctness under the convolutive separability assumptions in the noiseless case (Section 3.2) and in the presence of noise in (Section 3.3).

### 3.1 Description of the LECS Algorithm

At a high level, our approach will be as follows: we will use the successive projection algorithm (SPA, [16]) to identify the column indices of that correspond to scaled columns of , then leverage this noisy version of to obtain some estimate of . Finally, we will cluster and sort the rows of to find the rows of up to some error. The proposed Locate-Estimate-Cluster-Sort (LECS) algorithm is broken down into four main steps.

Locate. First, LECS ‘locates’ the column indices of that correspond to ; the columns of are present in , and accordingly their noisy variants are present in , so locating the column indices such that will identify a scaled, permuted variant of . The algorithm used a modified version of SPA [16], which we call Oracle Conic SPA (OrConSPA); see Algorithm 3 and Appendix A for more details. This method first removes the columns of that have a low signal to noise ratio using a hyperparameter . Then, the algorithm rescales the columns of in order to approximately project the columns of from the conic hull of the columns of onto the convex hull of the columns of (the projection is approximate because of the noise). Finally, the algorithm applies SPA to recover the index set . The name Oracle Conic SPA comes from the ‘oracle’ used to select the hyperparamter and from the fact that the algorithm allows to be in the conic hull of , in contrast to standard SPA which requires to be in the convex hull of . This method requires operations [16], applying SPA being the most expensive step.

Estimate. In this step, LECS estimates the rows of using nonnegative least squares (NNLS). Estimating is essential because the convolutive structure is contained in its rows, which are shifted variants of the rows of . Solving a convex NNLS problem up to any given precision can be performed in polynomial time using an interior point method (IPM). However, IPMs are second-order methods and hence are computationally demanding. We therefore instead use the block pivot method from [22] which requires one least squares solve per iteration. Each least squares solve takes operations, and the algorithm almost always converges after a few iterations.

Cluster. LECS then clusters the rows of into groups according to which row of they are shifted variants of. One cluster should contain the shifted variants of one row of . The algorithm achieves this by computing the -shift similarity between every pair of rows in , then greedily constructing clusters by adding the available row with the highest average similarity to the rows in the cluster. This simple greedy procedure requires operations; see Algorithm 4 and Lemma A.8.

Sort. Finally, within each cluster , LECS sorts the rows of based on their shifted similarity to the other rows in the cluster. The algorithm uses a comparison-based sorting algorithm with a comparison operator defined by shifted angle scores. In particular, for two indices , we have if can be better expressed as a shifted copy of than the other way around, measured via the cosine between the two vectors. This comparison operator is guaranteed to produce a strict, consistent ordering when the criteria of Theorem 3 are satisfied. However, note that, when these conditions fail to hold there is no guarantee that the comparison operator will produce a coherent ordering. Once this comparison operator has been defined, we can use any comparison-based sorting algorithm (such as merge sort or quick sort) to order the cluster ; in our implementation, we use a simple selection sort. When the operator is not consistent (that is, , but for some indices ), we select at each step the index which is less than or equal to the most other vectors (breaking ties arbitrarily). This sorting procedure requires operations; see Algorithm 5 and Lemma A.9.

Finally, using the clustering and sorting, the algorithm reconstructs rearranging the columns of . Each row of is constructed by taking the de-shifted average of the rows of in a particular cluster. The total computational cost of Algorithm 1 is operations plus the time to solve the nonnegative least squares problem.

### 3.2 Guarantee for the Exact Problem

The exact problem is when . Then [Bounded noise] is satisfied with . In this case, Algorithm 1 provably recovers and each for any and any such that .

###### Theorem 2 (Exact Recovery).

Given a CNMF problem that is convolutive separable with respect to and some , Algorithm 1 with inputs recovers the unique factorization satisfying (A), (B), and (C), up to a permutation and scaling, in polynomial time; more precisely flops plus one NNLS solve.

###### Proof of Correctness.

Because of conditions [Full Rank] and [Separable], SPA is guaranteed to find the columns of . [Full Rank] also guarantees nonnegative least squares will return the rows of . All that remains is to show the grouping correctly recovers and .

Given any two rows that are both shifted versions of some row of , their similarity is . Rows that are not shifted variants of one another cannot have similarity greater than or equal to by condition [Sequentially unique]. This means the grouping given by the clustering step is necessarily correct.

[Sequentially unique] also entails that the ordering from the sort step is also correct. Since the ordering is correct, the construction of and each must also be correct up to a permutation and scaling of the rows of .

Uniqueness follows from the success of the algorithm. Suppose we have two factorizations and such that . Then given , Algorithm 1 recovers the first factorization up to permutation and scaling. However, Algorithm 1 also recovers the latter factorization up to permutation and scaling. Hence it must be that one factorization is a permute, scaled version of the other, that is, for some permutation matrix and some diagonal matrix with positive diagonal elements. ∎

### 3.3 Guarantee for the Noisy Problem

Even when , we can still recover noisy estimates of given sufficiently low noise levels.

###### Theorem 3 (Noisy Recovery).

Suppose a CNMF problem with inputs is convolutive separable with respect to and consider some parameter . Let and and suppose

 ϵ+t <∥VA∥−1,col, (2) ϵt

where and are universal constants independent of all other terms. Furthermore, without loss of generality assume that and (see below). Then in polynomial time, more precisely flops plus one NNLS solve, Algorithm 1 finds with bounded error, in the sense that there exists some permutation such that

 min1≤j≤K cos(hj,~hP(j))≥1−δ2,

where are the th and th rows of and , respectively.

The proof of this theorem is deferred to Appendix A.3, where we prove a more general statement in Lemma A.10. The assumptions and are only used to simplify (4) and make the bound easier to read. These assumptions are also made without any loss of generality—we can always consider the equivalent problem with and for an arbitrarily large ; in particular for . This scaling does not impact the output of the algorithm (beyond scaling), and cancelling terms shows it also does not affect inequality (2) and inequality (3). However, one should note this scaling does impact since depends on , which changes inequality (4).

Algorithm 1 depends on selecting a good value for the parameter ; the best error bounds is obtained as . This parameter is used to threshold columns of with a small norm, since these columns will have relatively low signal-to-noise ratios and cause SPA to fail after scaling . In practice, one can run the algorithm several times using different values of and keep the best fit according to some heuristic (for example, mean square reconstruction error). In fact, since the conditions of the Theorem 3 imply that noisy versions of the columns of are present in , we can be certain that is within of for some . So we could run the algorithm times with parameter for and guarantee that one of the runs yields the desired solution.

## 4 Generalizations

One strength of our approach is that any of the subroutines used for the four steps (location, estimation, clustering, and sorting) can be substituted in exchange for different subroutines. For example, we use SPA to locate the columns of , but could substitute any separable NMF algorithm in its place. In this section, we provide several examples of improvements to LECS obtained by appropriate substitutions and highlight the generality of the high level algorithm.

##### Rank-deficient identification of V

We previously mentioned that [Full Rank] is required for two reasons: to guarantee the success of the SPA algorithm and to bound the error when estimating via nonnegative least squares. However, if we replace SPA with another separable NMF algorithm that does not require to have full column rank, then the first reason is no longer necessary. SNPA in particular essentially requires the weaker condition that no column of can be written as a conic combination of the other columns of . By using this method in place of SPA, we can complete the location step with weaker condition than [Full Rank]. This condition is also not necessary for estimating , which we discuss later in the section.

##### Sufficiently Scattered Condition

Since [Separable] is only used in identifying the columns of , we could also use the sufficiently scattered [21, 11], which is more general and allows one to determine by identifying the minimum-volume NMF of . Unfortunately, this more general condition suffers from two drawbacks: first, it has yet to be proven robust to noise, and, second, there are currently no algorithms that are guaranteed to find the minimum-volume NMF in polynomial time [11] (this problem is NP-hard in general, but it remains unknown whether this is still true under the sufficiently scattered condition).

##### Rank-deficient Estimation of G

Under some assumptions, the matrix need not be full rank to properly estimate . In particular, [24] gives perturbation bounds in the rank-deficient case in a more general setting; these bounds apply to our problem when the rank of is the same as the rank of .

##### Robust Least Squares

When we estimate , we solve the problem in an attempt to find some with rows similar to those of . If we know the noise level in advance or have some estimate of it, the optimal estimate is given by the robust optimization problem: , where , where depend on . In general, a number of least squares-like procedures may be used in place of standard nonnegative least squares in order to minimize the worst case error .

##### Generalized Similarity Metric

The -shift similarity measure utilizes the cosine of the angle between two vectors because it is scale invariant, allowing us to consider and equivalent, where and . The choice of similarity measure heavily influences the clustering and sorting steps, as seen in the proofs of Lemma A.8 and Lemma A.9. Therefore, a wise adjustment of this measure could lead to significant improvements in the final two steps of LECS. This might include a simple post-processing step, such as exponentiation, or an entirely different similarity measure. See [28, 10] for a more comprehensive review.

##### Spectral Clustering

Our greedy clustering algorithm is simple but suboptimal, in the sense that it does not take advantage of the global cluster structure. One more advanced approach is to use a spectral clustering method [28]. Let the matrix be defined as

 Mij={1if ∃k∈{1,…,K}% such that i,j∈Ck,0otherwise.

Let be the eigenvectors corresponding to the largest eigenvalues of . Orient each eigenvector so that (that is, ensure the entry with the largest magnitude is positive). Then the indices of the largest entries in form the representatives for cluster . The challenge is to construct a similarity matrix close to using only the observed data. One option is to construct using the -shift similarity, then set the largest entries to 1 and the remaining entries to 0. Under the conditions of Theorem 3, and therefore recovery is guaranteed. Results from matrix perturbation theory, such as the Davis-Kahan bound [6], suggest there is a strong theoretical justification for using a spectral clustering method, given the right construction of .

## 5 Numerical Experiments

We first test the LECS algorithm on synthetic data and verify it correctly finds the ground truth in low noise settings. Then, we demonstrate that LECS finds reasonable results on the spectrogram of a songbird and can be used as an effective initialization for other algorithms like multiplicative updates [31] and alternating nonnegative least squares (ANLS) [7].

All code is written in Julia [4] and available on GitHub at github.com/degleris1/CMF.jl. In particular, the code used at the time of publication is available in release v0.1. We use the code from github.com/ahwillia/NonNegLeastSquares.jl to solve NNLS problems via the pivot method. The figures for Sections 5.1 and 5.2 are produced by the Jupyter notebooks figures/sep_synth.ipynb and figures/sep_song.ipynb, and the LECS algorithm itself is available file src/algs/separable.jl. Note that this repository is available to other researchers as a tool for fitting CNMF models, as well as rapidly developing and testing new CNMF algorithms.

### 5.1 Unstructured Synthetic Data

In this experiment, we consider synthetic data that is separable by construction, but lacks structure in the sense that matrices are generated randomly without constraints on important parameters like condition number. Since angle determines the uniqueness of a vector up to scaling and plays an important role in our theoretical analysis, we use it to measure performance. Specifically, after the algorithm estimates , the score is computed as the average cosine of the angles between the rows of and the rows of (minimized over all row permutations). This is formally given by

 score(H,~H)=minP∈P1KK∑k=1cos(H[k,:],~H[P(k),:]),

where is the set of all permutations. We test four different algorithms:

LECS. The LECS algorithm as presented in Algorithm 1.

LECS-Pre. The LECS algorithm with SPA replaced by heuristically preconditioned SPA from [17]. Heuristically preconditioned SPA computes the SVD of , then runs SPA on . Under the assumption that the data points are evenly distributed within the convex cone generated by , it is more robust than SPA [15]. This illustrates the flexibility of LECS with respect to the choice of the different building blocks; see Section 4.

Mult. The multiplicative updates algorithm for minimizing the Frobenius norm used in [31, 27]. This also corresponds to the -divergence update rule from [18, 9] when . The algorithm is initialized randomly and scaled using the method from [19]. The alternating update rules runs for 5 seconds.

LECS-Mult. The same multiplicative updates algorithm as before, except using LECS as its initialization.

For each , we let for all , where is the uniform distribution in the interval . We let for all , where is our sparsity parameter and is the Bernoulli distribution with parameter . To enforce the separability condition, we then set some entries of to zero as follows. For each row of , choose two random . For all satisfying either or , we set . Then we let . Finally, we ensure each are sufficiently far apart from one another so that the same entries are never set to zero twice. This construction ensures the first half of each sequence () is separable at one location, and the second half at another location. We choose “half-sequences” to demonstrate the full sequence need not be separable at the same location. To generate the noise, we vary a noise parameter and use one of three procedures: for all

Uniform. Set .

Gaussian. Set where .

Exponential. Set .

We generate random matrices according the above construction with , , , , . Then, for each matrix, the noise level is varied across different values spaced logarithmically between and .

The results of these experiments are displayed in Figure 2. We observe that for all three noise types, LECS finds the true for small noise levels, as expected by Theorem 3. Additionally, as expected, the preconditioning improves its tolerance to noise. In comparison, Mult fail to find the true even in the lowest noise settings. In fact, although Mult finds a solution with low reconstruction error, it is not able to recover the ground truth factor : this is due to the non-uniqueness problem with CNMF (see Section 1). As a matter of fact, Mult with LECS as an initialization keeps the true solution and outperforms random initializations. At higher noise levels, the effects are varied. In particular, LECS-Mult exhibits different performance depending on the type of noise, and notably performs the best (relative to random initialization) when the noise is Gaussian. This is likely due to the fact this update rule attempts to minimize the reconstruction error using a Frobenius loss function, which is a natural objective for Gaussian noise but not uniform nor exponential noise. A different loss function is likely necessary to identify the ground truth given these noise models; for example, entrywise -norm and -norm losses may be more appropriate for uniform and exponential noise, respectively.

### 5.2 Songbird Spectrogram

In practice, noise levels are often too high for the LECS algorithm to identify the ground truth, even when the separability assumptions are satisfied. However, LECS can be used as an initialization to other algorithms for CNMF to achieve faster convergence than random initialization, and to obtain better solutions.

In this experiment, we fit a spectrogram of a singing bird from [27]; the authors have kindly made this dataset publicly available at github.com/FeeLab/seqNMF. The spectrogram matrix has 141 rows (DFT bins) and 4440 columns (timebins). We fit the spectrogram using the LECS algorithm with , and a threshold value . Then, we use this initialization to run Mult [31] and ANLS [7] algorithms for 15 and 60 iterations, respectively. Note that ANLS, as for NMF [22], solves the subproblem exactly for and using an active set method, as opposed to Mult that is a gradient-like descent method. The threshold value was chosen after sweeping over all values from 0 to 50 (approximately the maximum 1-norm of a column of ).

Figure 3 displays the loss curves for Mult and ANLS using both random initialization and the output of LECS. The loss is measured as the relative mean square error, defined as

 MSErel=∥~X−∑Lℓ=1WℓHSℓ−1∥F∥~X∥F,

where are the estimated matrices.

We observe that when either Mult or ANLS are initialized using the output of LECS, they converge to a local minimum in fewer iterations. This suggests that LECS finds a solution near some local minimum and could be used to accelerate algorithms on very large datasets. In fact, when used as an initialization for ANLS, LECS allows to obtain a high quality solution faster than random initializations. For Mult, using LECS as an initialization does not perform as well because, on average and after sufficiently many iterations, the solution generated by random initializations have lower reconstruction error. We suspect that the reason is that Mult does not deal well with input matrices with many small entries (which LECS generates) because of the zero locking phenomenon–zero entries cannot be modified by Mult while it takes many iterations for small entries to get large; see for example the discussion in [13] and the references therein. In fact, with the LECS initialization, ANLS reduces the relative error to 56.6% after 15 iterations, while MU reduces it to only 58.4% after 60 iterations. Note also that ANLS performs on average better than MU with an average relative error of 56.6% for ANLS vs. 57.7% for MU; this was already observed [7].

Moreover, despite a high relative mean square error, Figure LABEL:fig:_song-qual demonstrates that LECS outputs factors and a reconstruction with qualitative similarities to the ground truth. In fact, after just a single iteration of the ANLS algorithm, the factors visibly mimic the sequential structure in the dataset. Note that, as opposed to the experiments on the synthetic data set performed in the previous section, we cannot assess the quality of the generated factors as the ground truth sources are not available.

## 6 Conclusion

In this paper we have presented a provably correct and robust algorithm for CNMF under separability conditions. By addressing the questions of identifiability and provable recovery of the solution, this paper offers a novel theoretical perspective on the CNMF problem. Our algorithm draws directly from methods for separable NMF and is easily generalizable—the high level algorithm does not require any particular method for each of its four steps.

Future work includes addressing two significant weakness of our approach. First, the location step does not leverage the convolutive structure of CNMF (and could therefore possibly be made more tolerant to noise). Second, the method cannot handle the case when some has repeated columns, which is relevant to processing music datasets that often contain pure harmonic tones.

## Appendix A Appendix

### a.1 Conic SPA Procedure

As mentioned in the introduction, SPA is a sequential algorithm for separable NMF that identifies the columns of among the columns of . At each step, it first extracts the column of that has the largest norm, and then project all columns of onto the orthogonal complement of the extracted column; see Algorithm 2. Under the assumptions that is full column rank and the columns of have unit norm, SPA recovers the set , even in the presence of noise. In this section we generalize SPA from an algorithm for finding the vertices of a polytope (because the columns of are required to have unit norm) to an algorithm for finding the extreme rays of a cone. In the noiseless case, it is relatively straightforward to show this generalization by an appropriate rescaling of . However, when noise is present, extra steps must be taken to ensure points that are primarily noise do not grow too large and are mistaken for a vertex. In fact, in the robustness analysis of SPA, it is assumed from the scratch that the input noisy data matrix has the form , where the columns of are normalized which is key for SPA to succeed. If is not normalized, then normalization of the input matrix is necessary and this might increase the noise drastically; for example for the columns of with very small norm. Therefore, to make SPA robust to noise in this scenario, we will need to first remove the columns of with small norm.

###### Lemma A.1 (Spa).

[16, Theorem 3] Let , where has rank and satisfies for all . Additionally, suppose that for some permutation matrix and .

Let for all with , where is a global constant. Let be the index set extracted by SPA, which takes time . Then there exists a permutation of such that

 max1≤j≤R ∥~X[:,J(j)]−V[:,P(j)]∥=C2ϵκ(V)2,

where is a global constant.

Note that Lemma A.1 makes two strong assumptions on ; first, that for all , and second, that contains the identity matrix as a submatrix. Because of the additional structure of the convolutive NMF problem, it is unreasonable to have both these assumptions simultaneously. Instead, we drop the first assumption and generalize the SPA algorithm to any such that , where is a permutation matrix and . Note that this formulation is equivalent to , where is some full rank diagonal matrix, since we can always scale the th row of by and scale the th column of by without changing . We first demonstrate conditions under which we can recover if some oracle gives us a threshold value which is greater than the noise level but smaller than .

###### Lemma A.2 (Oracle Conic SPA).

Let , where has rank and for some permutation matrix and . Denote the th column of and , respectively. Let for all and suppose we know some such that . Assume

 ϵt

Let be the index set extracted by OrConSPA with inputs . Then there exists a permutation of such that

 max1≤j≤r ∥(~XD~X)[:,J(j)]−(VDV)[:,P(j)]∥2 ≤2C2(ϵt−1)κ(VDV)2, % and (6)
 max1≤j≤r ∥~X[:,J(j)]−V[:,P(j)]∥2 ≤(∥V∥1,col+ϵ)2C2(ϵt−1)κ(VDV)2. (7)
###### Proof.

Let be the th column of , and allow the same notation for all other matrices. Since the oracle has given us (where ), we know that any column with cannot satisfy for any . Therefore, we can define as a diagonal matrix given by

 Yjj={0 if ∥~xj∥1

Now define . Consider the matrices , , and . It must be that the columns of sum to one, since the columns of and sum to one. Furthermore, columns of with only a single non-zero entry will only be scaled by some scalar , so for some and permutation matrix . Therefore, all satisfy the conditions of Lemma A.1; if we wish to apply SPA to identify from , all that remains is to bound the noise . To bound the noise, we note that for any non-zero column of , we have , and therefore we have the following bound for