Learning Mixtures of Sparse Linear Regressions Using Sparse Graph Codes

Learning Mixtures of Sparse Linear Regressions Using Sparse Graph Codes

Dong Yin Department of Electrical Engineering and Computer Sciences, UC Berkeley Ramtin Pedarsani Department of Electrical and Computer Engineering, UC Santa Barbara Yudong Chen School of Operations Research and Information Engineering, Cornell University Kannan Ramchandran Department of Electrical Engineering and Computer Sciences, UC Berkeley

In this paper, we consider the mixture of sparse linear regressions model. Let be unknown sparse parameter vectors with a total of non-zero coefficients. Noisy linear measurements are obtained in the form , each of which is generated randomly from one of the sparse vectors with the label unknown. The goal is to estimate the parameter vectors efficiently with low sample and computational costs. This problem presents significant challenges as one needs to simultaneously solve the demixing problem of recovering the labels as well as the estimation problem of recovering the sparse vectors .

Our solution to the problem leverages the connection between modern coding theory and statistical inference. We introduce a new algorithm, Mixed-Coloring, which samples the mixture strategically using query vectors constructed based on ideas from sparse graph codes. Our novel code design allows for both efficient demixing and parameter estimation. The algorithm achieves the order-optimal sample and time complexities of in the noiseless setting, and near-optimal complexities in the noisy setting. In one of our experiments, to recover a mixture of two regressions with dimension and sparsity , our algorithm is more than times faster than EM algorithm, with about of its sample cost.

1 Introduction

Mixture and latent variable models, such as Gaussian mixtures and subspace clustering, are expressive, flexible, and widely used in a broad range of problems including background modeling [1], speaker identification [2] and recommender systems [3]. However, parameter estimation in mixture models is notoriously difficult due to the non-convexity of the likelihood functions and the existence of local optima. In particular, it often requires a large sample size and many re-initializations of the algorithms to achieve an acceptable accuracy.

Our goal is to develop provably fast and efficient algorithms for mixture models — with sample and time complexities sublinear in the problem’s ambient dimension when the parameter vectors of interest is sparse — by leveraging the underlying low-dimensional structures.

In this paper we focus on a powerful class of models called mixtures of linear regressions [4]. We consider the sparse setting with a query-based algorithmic framework. In particular, we assume that each query-measurement pair is generated from a sparse linear model chosen randomly from possible models:111We use to denote the conjugate transpose of , and the set of integers .


where is noise. The total number of nonzero elements in the parameter vectors is assumed to be . The goal is to estimate the ’s, without knowing which generates each query-measurement pair.

A mixture of regressions provides a flexible model for various heterogeneous settings where the regression coefficients differ for different subsets of observations. This model has been applied to a broad range of tasks including medicine measurement design [5], behavioral health care [6] and music perception modeling [7]. Here, we study the problem when the query vectors can be designed by the user; in Section 1.2 we discuss several practical applications that motivate the study of this query-based setting. Our results show that by appropriately exploiting this design freedom, one can achieve significant reduction the sample and computational costs.

To recover unknown non-zero elements, it is clear that the amount of measurements and time required scale at least as . We introduce a new algorithm, called the Mixed-Coloring algorithm, that matches these sublinear sample and time complexity lower bounds. The design of query vectors and decoding algorithm leverages ideas from sparse graph codes such as low-density parity-check (LDPC) codes [8]. Our algorithm recovers the parameter vectors with optimal sample and time complexities in the noiseless setting, both in theory and empirically, and is stable under noise with near-optimal sample and time complexities. Prior literature on this problem that does not utilize the design freedom typically have sample/time complexities that are at least polynomial in ; we provide a survey of prior work and a more detailed comparison in Section 6. Empirically, we find that our algorithm is orders of magnitude faster than standard Expectation-Maximization (EM) algorithms for mixture of regressions. For example, in one of our experiments, detailed in Section 5, we consider recovering a mixture of two regressions with dimension and sparsity ; our algorithm is more than times faster than EM algorithm, with about of its sample cost.

1.1 Algorithm Overview

Our Mixed-Coloring algorithm solves two problems simultaneously: (i) rapiddemixing, namely identifying the label of the vector that generates each measurement ; (ii) efficient identification of the location and value of the non-zero elements of the ’s. The main idea is to use a divide-and-conquer approach that iteratively reduce the original problem into simpler ones with much sparser parameter vectors. More specifically, we design sets of sparse query vectors, with each set only associated with a subset of all the non-zero elements. The design of the query vectors ensures that we can first identify the sets which are associated with a single non-zero element (called singletons), and recover the location and value of that element (we call them singleton balls, shown as shaded balls in Figure 0(b)). We further identify the pairs of singleton balls which have the same (but unknown) label, indicated by the edges in Figure 0(b). Results from random graph theory guarantees that, with high probability, the largest connected components (giant components) of the singleton graph have the different labels, and thus we recover a fraction of the non-zero elements in each , as shown in Figure 0(c). We can then iteratively enlarge the recovered fraction with a guess-and-check method until finding all the non-zero elements. We revisit Figure 1 when describing the details of our algorithm in Section 3.

(a) Non-zero elements (b) Singleton balls (c) Giant components (d) Results
Figure 1: Mixed-Coloring algorithm with .

1.2 Motivation

Our problem is a natural extension of the setting of compressive sensing,222Compressive sensing is a special case of our problem with . in which one often has full freedom of designing query vectors in order to estimate a sparse parameter vector. In many applications, the unknown sparse parameter vector can be affected by latent variables, leading to a mixture of sparse linear regressions, and these scenarios have been observed in neuroscience [9], genetics [10], psychology [5], etc. Here, we provide a concrete example motivated by neuroscience applications [9]. In neural signal processing, sensors are used to measure the brain activities, represented by an unknown sparse vector . The sensors can be modeled as digital filters, and one can design the linear filter weights (’s) when measuring the neural signal. Multiple sensors are usually placed in a particular area of the brain in order to acquire enough compressed measurements. However, there may be more than one neuron affecting a particular area of the brain, as shown in Figure 2, and each neuron may have different activities, corresponding to a different . Consequently, each sensor may be measuring one of several different sparse signals, which can be formulated as a mixture-of-sparse-linear-regressions problem. Variants of this problem, such as neural spike sorting [9], has been studied in neuroscience. While the common solution is to use clustering algorithms on the spike signals, we believe that our algorithm provides the potential of improving sensor design and reducing sample and time complexities.

Figure 2: Mixture of neural signals.

In addition, our work adds the intellectual value of the power of design freedom in tackling sparse mixture problems by highlighting the huge performance gap between algorithms that can exploit the design freedom and those that cannot. We also believe that our ideas are applicable more broadly for other latent-variable problems that require experimental designs, such as survey designs in psychology with mixed type of respondents and biology experiments with mixed cell interior environments.

2 Main Results

In this section, we present the recovery guarantees for the Mixed-Coloring algorithm, and provide bounds on its sample and time complexities. We assume there are unknown -dimensional parameter vectors . Each has non-zero elements, i.e., . Let be the total number of non-zero elements. Using the query vectors , the Mixed-Coloring algorithms obtains measurements , generated independently according to the model (1), and outputs an estimate , of the unknown parameter vectors. We defer more details to Sections 3 and 4.

Our results are stated in the asymptotic regime where and approach infinity. A constant is a quantity that does not depend on and , with the associated Big-O notations and . We assume that is a known and fixed constant, and the mixture weights satisfy for each and thus are of the same order. Similarly, the sparsity levels of the parameter vectors are also of the same order with .

2.1 Guarantees for the Noiseless Setting

In the noiseless case, i.e., , we consider for generality the complex-valued setting with (our results can be easily applied to real case). We make a mild technical assumption, which stipulates that if any pair of parameter vectors have overlapping support, then the elements in the overlap are different.

Assumption 1.

For each pair , and each index , we have .

Under the above setting, we have the following recovery guarantees for the Mixed-Coloring algorithm.

Theorem 1.

Consider the asymptotic regime where and approach infinity. Under Assumption 1, for any fixed constant , there exists a constant such that if the number of measurements is , then the Mixed-Coloring algorithm satisfies the following three properties for each (up to a label permutation):

  1. (No False Discovery) For each , equals either or ; for each , .

  2. (Element-wise Recovery) There exists a constant such that for each .

  3. (Support Recovery)

Moreover, the computational time of the Mixed-Coloring algorithm is .

The theorem ensures that the Mixed-Coloring algorithm has no false discovery, and recovers fraction of the non-zero elements with high probability. The error fraction is an input parameter to algorithm, and can be made arbitrarily close to zero by adjusting the oversampling ratio . (By more careful analysis, one can show that the dependence of on is . Here, since we set as a constant, is a constant.) Given the number of components , mixture weights and the target , the value of the constant can be computed numerically. The table below gives some of the values for several and , under the setting . We see that the value of is quite modest.

Table 1: Sample complexity of the Mixed-Coloring algorithm

We can in fact boost the above guarantee to recover all the non-zero elements, by running the Mixed-Coloring algorithm times independently and aggregating the results by majority voting. By property 2 in Theorem 1 and a union bound argument, this procedure exactly recovers all the parameter vectors with probability with sample and time complexities.

2.2 Guarantees for the Noisy Setting

An extension of the previous algorithm, Robust Mixed-Coloring, handles noise in the measurement model (1). Here we focus on the case with two parameter vectors which appear equally likely, i.e., and , . Many interesting applications have binary latent factors: gene mutation present/not, gender, healthy/sick individual, children/adult, etc. The noise is assumed to be i.i.d. Gaussian with mean zero and constant variance . For the purpose of theoretical analysis, we assume that the non-zero elements in the parameter vectors take value in a finite quantized set.

Assumption 2.

The non-zero elements of the parameter vectors satisfy , where

The positive constants and are known to the algorithms.

As shown in our empirical results in Section 5, the Robust Mixed-Coloring algorithm works even when the assumption is violated. In this case, the algorithm produces the best quantized approximation to the unknown parameter vectors, provided that they are not too far off the quantized set. The theoretical results for the continuous alphabet setting is still an open problem, and the tools in recent work such as [11] may be applied to our problem.

When the quantization assumption holds, exact recovery is possible, as guaranteed in the theorem below. The Robust Mixed Coloring algorithm maintains sublinear sample and time complexities, and recovers the parameter vectors in the presence of noise with bounded variance.

Theorem 2.

Consider the asymptotic regime where and approach infinity with for some constant .When and Assumptions 1 and 2 hold, there exists a constant , such that if and the number of measurements is , then the Robust Mixed-Coloring algorithm satisfies the three properties in Theorem 1. Moreover, the computational time of the Robust Mixed-Coloring algorithm is .

Similar to the noiseless case, by running the Robust Mixed-Coloring algorithm times, one can exactly recover the two parameter vectors with probability . In this case, the sample and computational complexities are , and further, since we assume that for some constant , we can still conclude that the sample and computational complexities for full recovery are .

3 Mixed-Coloring Algorithm for Noiseless Recovery

In this section, we provide details of the Mixed-Coloring algorithm in the noiseless setting. We first provide some primitives that serve as important ingredients in the algorithm, and then describe the design of query vectors and decoding algorithm in detail.

3.1 Primitives

The algorithm makes uses of four basic primitives: summation check, indexing, peeling, and guess-and-check, which are described below.

Summation Check: Suppose that we generate two query vectors and independently from some continuous distribution on , and a third query vector of the form . Let , , and be the corresponding measurements. We check the sum of the measurements and in the noiseless case, if , then with probability one, we know that these three measurements are generated from the same parameter vector . In this case we call a consistent pair of measurements as they are from the same (the third measurement is now redundant).

Indexing: The indexing procedure is to find the locations and values of the non-zero elements by carefully designed query vectors. In the noiseless case, this can be done by suitably designed ratio test. We sketch the idea of the ratio test here. Consider a consistent pair of measurements and corresponding query vectors . We design the query vectors such that the information of the locations of the non-zero elements is encoded in the relative phase between and . In particular, we generate i.i.d. random variables uniformly distributed on the unit circle. Letting where is the imaginary unit, we set the -th entries of and to be either , or and . (The locations of the zeros are determined using sparse-graph codes and discussed later.) Below is an example of such a consistent pair of measurements and the corresponding linear system:


Suppose that is -sparse and of the form . There is only one non-zero element, , that contributes to the measurements and . In this case the consistent measurement pair is called a singleton. A singleton can be detected by testing the integrality of the relative phase of the ratio . In the above example, since and , we observe that and the relative phase is an integral multiple of . We therefore know that with probability one, this consistent pair is a singleton, and moreover the corresponding non-zero element is located at the -rd coordinate with value . We would like to remark that the indexing step can also be done using real-valued query vectors.

Peeling: The third ingredient of the decoder is peeling, i.e., iteratively reducing the problem by subtracting off recovered elements, in a Gaussian elimination-like manner. In the example above, suppose instead that is -sparse, i.e., , in which case the consistent pair


is associated with two non-zero elements of . If in a previous iteration of the algorithm we have recovered the location and value of , then we can subtract/peel off this recovered element by , for .

The updated measurement pairs satisfy , and we have reduced the problem to a simpler form. In fact, in this case the pair becomes a singleton, to which the above ratio test can be applied to recover .

Guess-and-check: The ratio test and peeling steps can be combined to detect that two non-zero elements are from the same parameter vectors. In the previous example (3), suppose instead that we recovered two elements and in previous iterations via ratio-testing another two consistent pairs that are singletons, but values of their labels and are unknown. We can still try to peel off from ; if the updated measurements pass the ratio test and recover a non-zero element with location and value , then we know that with probability one the non-zero elements and must come from the same parameter vector (the one that generates ), i.e., . In this case the peeling step is valid.

The continuing execution of these four primitives is made possible by the design of the query vectors using sparse-graph codes, which we describe next.

3.2 Design of Query Vectors

Figure 3: query vectors.

As illustrated in Figure 3, we construct sets of query vectors (called bins). The query vectors in each bin are associated with some coordinates of the parameter vectors (i.e., the queries are non-zero only on those coordinates). The association between the coordinates and bins is determined by a -left regular bipartite graph with left nodes (coordinates) and right nodes (bins), where each left node is connected to right nodes chosen independently uniformly at random. Each bin consists of three query vectors. The values of the non-zero elements of the first two query vectors are in the form of (2), enabling the ratio test. The third query vectors equals the sum of the first two and is used for the summation check.

If the query vectors in each bin were used only once, then we would have very few bins passing the summation check and hence few consistent pairs. Instead, we use the first two query vectors repeatedly for times, obtaining two sets of measurements, each of size and called type-I and type-II index measurements. We use the third query vector times to obtain a set of verification measurements. We therefore have measurements associated with each of the bins, hence a total of measurements, as shown in Figure 3. Using density evolution methods [12], we can find proper values of , , , and such that successful recovery is guaranteed.

3.3 Decoding Algorithm

The decoding algorithm first finds consistent pairs (by summation check) in each bin, within which singletons are identified (by the ratio test). The ratio test also recovers the location and values of several non-zero elements, some of which can then be associated with the same by guess-and-check. At this point, for each , we have recovered some of its non-zero elements (including their locations, values and labels). These steps are then repeated iteratively via peeling until no more non-zero elements can be found. Below we elaborate on these steps.

Finding Consistent Pairs:

The decoding procedure starts by finding all the consistent pairs. In each bin, we perform summation checks on all triplets in which , , and are the type-I index measurement, type-II index measurement and verification measurement, respectively. If a triplet passes the summation check, then a consistent pair is found. Note that in each bin the number of triplets of the above form is a constant, so this step can be done in time. The subsequent steps of the algorithm are based on the consistent pairs found in this step.

Recovering a Subset of Non-zero Elements:

Each non-zero element of the parameter vectors can be identified by its label-location-value triplet . We visualize these triplets (i.e., non-zero elements) as balls, as shown in Figure 0(a), and initially their labels, locations and values are unknown. As before, a consistent pair associated with only one non-zero element is called a singleton, and we call this non-zero element a singleton ball. We run the ratio test on the consistent pairs to identify singletons and their associated singleton balls. The singleton balls found are illustrated in Figure 0(b) as shaded balls. The ratio test also recovers the locations and values of these singleton balls, although at this point we do not know the label of the balls.

The next step is crucial: For two singleton balls and a consistent measurement pair associated with the locations of these two balls, we run the guess-and-check operations to detect if these two singleton balls indeed have the same label (or equivalently, if the two non-zero elements are in the same parameter vector). If so, we connect these two balls with an edge, as shown in Figure 0(b). Doing so creates a graph over the balls (i.e., non-zero elements), and each connected component of the graph is from a single parameter vector. Since each non-zero element is associated with a constant number of consistent pairs (due to using a -left regular bipartite graph with constant ), this step can in fact be done efficiently in time without enumerating all the combinations of singleton ball pairs.

By carefully choosing the parameters , , , and , and using tools from random graph theory, we can ensure that with high probability the largest connected components (called giant components) correspond to the parameter vectors, and each of these components has size . Then, the labels of the balls in these components are now identified. This is illustrated in Figure 0(c) for , where colors represent the labels. In summary, at this point we have recovered the labels, locations and values of a constant fraction of the non-zero elements (i.e., balls) of each parameter vector.

Figure 4: Iterative decoding. If a ball is peeled off, the edges connected to it are shown in dashed lines. The colored balls in (b) are found by the giant component method. In (c) and (d), more balls are colored by iterative decoding.

Iterative Decoding:

The decoding procedure proceeds by identifying the labels of the remaining balls via iteratively applying the peeling and guess-and-check primitives. The connected components in Figure 0(c) are therefore expanded, until no more changes can be made, as illustrated in Figure 0(d).

We provide an example of this iterative procedure in Figure 4. Recall that the association between the coordinates of the parameter vectors and the bins (or consistent pairs) is determined by a bipartite graph. Here, we only show one consistent pair for each bin and omit the zero elements. The non-zero elements and the consistent pairs are shown as balls and squares, respectively, as in Figure 3(a). The steps described in the last part recover a subset of these balls, which are shown in colors in Figure 3(b). Now consider the measurement pair 1, which is associated with the balls , and . As and are recovered, we can peel them off from the measurement pair 1 to recover (by the ratio test) the label, location and value of the non-zero element represented by ball . Similarly, peeling off the recovered ball from the measurement pair , recovers ball , as illustrated in Figure 3(c). We continue this process iteratively, peeling off balls recovered in the previous iterations to recover more balls. For example, we peel off the balls and from the measurement pair to recover the ball , and the ball from pair to recover ball , resulting in Figure 3(d). So far we have described the Mixed-Coloring algorithm in the noiseless case, and we refer readers to Section B of the appendices for the analysis of this algorithm.

4 Robust Mixed-Coloring Algorithm for Noisy Recovery

The overall structure of the Robust Mixed-Coloring algorithm is the same as its noiseless counterpart. In the presence of noise, the ratio test method for indexing and the summation check primitive need to be robustified, which are done by a modification of the query design. In particular, we design three types of query vectors. The first type, called binary indexing vectors, encodes the location information using binary representations with, bits (as opposed to using the relative phases in the noiseless case). A similar approach is considered in [13] for compressive phase retrieval. The second type is called singleton verification vectors, which are used for singleton detection. Using these two types of vectors we can modify the ratio test to achieve the same performance with noise. The third type of query vectors is used for consecutive summation check, which finds consistent sets of measurements.

In addition to the new query design, we also employ a noise reduction scheme. This is done by using each designed query vector (say ) repeatedly for times and averaging the corresponding measurements from the same . In particular, these measurements are sampled i.i.d. from a mixture of two Gaussians with centers and , so we use an EM algorithm initialized by moment methods to estimate the two centers. Using the result in [14], we prove that the EM-based noise reduction scheme succeeds under the conditions in Theorem 2, namely and . We refer the readers to Section D of the appendices for the details of the Robust Mixed-Coloring algorithm, and Section F for more details of the EM algorithm that we use.

5 Experimental Results

In this section, we test the sample and time complexities of the Mixed-Coloring algorithm in both noiseless and noisy cases to verify our theoretical results. We refer the readers to the appendices for more details of the experiments.

For the noiseless case, we use the optimal parameters from numerical calculations of the density evolution. For different values of , we record the empirical success probability and running time averaged over trials. Here, we use a sufficiently small so that the success event is equivalent to recovery of all the non-zero elements. The results are shown in Figure 4(a). The phase transition occurs at some that matches the values in Table 1 predicted by our theory. Moreover, the running time is linear in and does not depend on , as shown in Figure 4(b).

(a) Probability of success
(b) Time complexity
Figure 5: Success probability and running time in the noiseless case.

Similar experiments are performed for the noisy case using the Robust Mixed-Coloring algorithm, under the quantization assumption. Figure 5(a) shows the minimum number of queries required for 100 consecutive successes, for different and . We observe that the sample complexity is linear in and sublinear in . The running time exhibits a similar behavior, as shown in Figure 5(b). Both observations agree with the prediction of our theory.

(a) Sample complexity
(b) Time complexity
Figure 6: Sample and time complexities of Robust Mixed-Coloring algorithm.

We also compare the Mixed-Coloring algorithm with a state-of-the-art EM-style algorithm (equivalent to alternating minimization in the noiseless setting) from [15]. These comparisons are not entirely fair, since our algorithm is based on carefully designed query vectors, while the algorithm in [15] uses random design, i.e., the entries of ’s are i.i.d. Gaussian. However, this is exactly where the intellectual value of our work lies: we expose the gains available by careful design. We consider four test cases with , with the first two cases being sparse problems and the last two being relatively dense problems. We find the minimum number of queries that leads to a 100% successful rate in 100 trials, and the average running time. As shown in Table 2, in both sparse and dense problems, our Mixed-Coloring algorithm is several orders of magnitude faster. As for the sample complexity, our algorithm requires smaller number of samples in the sparse cases, while in dense problems, the sample complexity of our algorithm is within a constant factor (about 3) of that of the alternating minimization algorithm. For the noisy setting, our algorithm is most powerful in the high dimensional setting, i.e., large , due to the factors. However, in this setting, it takes extremely long time for the state-of-the-art algorithms such as [16] to converge, and thus, we do not present the comparison in the noisy setting.

Table 2: Comparison of two algorithms
Figure 7: Performance of Robust Mixed-Coloring algorithm with quantization assumption violated.

We further test the Robust Mixed-Coloring algorithm when the quantization assumption is violated. For any , we define , where denotes the indicator function. This means that is the element in which is the closest one to , when . For a vector , we define . We define the perturbation of a vector as .

In this experiment, we generate sparse parameter vectors , with a total number of non-zero elements. These non-zero elements are generated randomly while keeping the perturbation of the parameter vectors under a certain level by adding bounded noise to the quantized non-zero elements. We record the probability of success for different number of bins and different perturbation level. Here the success event is defined as recovery of for all . The result is shown in Figure 7. We see that the Robust Mixed-Coloring algorithm works without the quantization assumption as long as the perturbations are not too large.

6 Related Work

6.1 Mixtures of Regressions

Parameter estimation using the expectation-maximization (EM) algorithm is studied empirically in [17]. In [16], an -penalized EM algorithm is proposed for the sparse setting. Theoretical analysis of the EM algorithm is difficult due to non-convexity. Progress was made in [15] and [14] under stylized Gaussian settings with dense , for which a sample complexity of is proved given a suitable initialization of EM. The algorithm uses a grid search initialization step to guarantee that the EM algorithm can find the global optimal solution, with the assumption that the query vectors are i.i.d. Gaussian distributed. The computational complexity is polynomial of . An alternative algorithm is proposed in [18], which achieves optimal sample complexity, but has high computational cost due to the use of semidefinite lifting. The algorithm in [19] makes use of tensor decomposing techniques, but suffers from a high sample complexity of . In comparison, our approach has order optimal sample and time complexities by utilizing the potential design freedom.

6.2 Coding-theoretic Methods

Many modern error-correcting codes such as LDPC codes and polar codes [20] with their roots in communication problems, exploit redundancy to achieve robustness, and use structural design to allow for fast decoding. These properties of codes have recently found applications in statistical problems, including graph sketching [21], sparse covariance estimation [22], low-rank approximation [23], and discrete inference [24]. Most related to our approach is the work in [25, 26, 13], which apply sparse graph codes with peeling-style decoding algorithms to compressive sensing and phase retrieval problems. In our setting we need to handle a mixture distribution, which requires more sophisticated query design and novel unmixing algorithms that go beyond the standard peeling-style decoding.

6.3 Combinatorial and Dimension Reduction Techniques

Our results demonstrate the power of strategic query and coding theoretic tools in mixture problems, and can be considered as efficient linear sketching of a mixture of sparse vectors. In this sense, our work is in line with recent work that make uses of combinatorial and dimension reduction techniques in high-dimensional and large scale statistical problems. These techniques, such as locality-sensitive hashing [27], sketching of convex optimization [28], and coding-theoretic methods [29], allow one to design highly efficient and robust algorithms applicable to computationally challenging datasets without compromising statistical accuracy.

7 Conclusions

We propose the Mixed-Coloring algorithm as a query based learning algorithm for mixtures of sparse linear regressions. The design of the query vectors and the recovery algorithm are base sparse graph codes, and our scheme achieves order optimal sample and computational complexities in the noiseless case, and sublinear sample and time complexities in the presence of noise. Our experiments justified the theoretical results. In the noisy scenario, studying the Robust Mixed-Coloring algorithm with more than two parameter vectors and obtain theoretical results for the continuous alphabet can be two important future directions.


  • [1] M. Harville, “A framework for high-level feedback to adaptive, per-pixel, mixture-of-gaussian background models,” in Computer Vision ECCV 2002.   Springer, 2002, pp. 543–560.
  • [2] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1, pp. 19–41, 2000.
  • [3] A. Zhang, N. Fawaz, S. Ioannidis, and A. Montanari, “Guess who rated this movie: Identifying users through subspace clustering,” arXiv preprint arXiv:1208.1544, 2012.
  • [4] R. De Veaux, “Mixtures of linear regressions,” Comp. Statistics & Data Analysis, vol. 8, no. 3, 1989.
  • [5] E. Blackwell, C. F. M. de Leon, and G. E. Miller, “Applying mixed regression models to the analysis of repeated-measures data in psychosomatic medicine,” Psychosomatic Medicine, vol. 68, no. 6, 2006.
  • [6] P. Deb and M. Holmes, “Estimates of use and costs of behavioural health care: a comparison of standard and finite mixture models,” Econometric Analysis of Health Data, pp. 87–99, 2002.
  • [7] K. Viele and B. Tong, “Modeling with mixtures of linear regressions,” Statistics and Computing, vol. 12, no. 4, pp. 315–330, 2002.
  • [8] R. Gallager, “Low-density parity-check codes,” IRE Transactions on information theory, vol. 8, no. 1, pp. 21–28, 1962.
  • [9] M. S. Lewicki, “A review of methods for spike sorting: the detection and classification of neural action potentials,” Network: Computation in Neural Systems, vol. 9, no. 4, pp. R53–R78, 1998.
  • [10] R. Jansen, “A general mixture model for mapping quantitative trait loci by using molecular markers,” Theoretical and Applied Genetics, vol. 85, no. 2-3, pp. 252–260, 1992.
  • [11] D. Yin, R. Pedarsani, X. Li, and K. Ramchandran, “Compressed sensing using sparse-graph codes for the continuous-alphabet setting,” 54nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016.
  • [12] T. Richardson and R. Urbanke, “The capacity of low-density parity-check codes under message-passing decoding,” IEEE Transactions on Information Theory, vol. 47, pp. 599–618, February 2001.
  • [13] D. Yin, K. Lee, R. Pedarsani, and K. Ramchandran, “Fast and robust compressive phase retrieval with sparse-graph codes,” in IEEE International Symposium on Information Theory, 2015, pp. 2583–2587.
  • [14] S. Balakrishnan, M. J. Wainwright, and B. Yu, “Statistical guarantees for the em algorithm: From population to sample-based analysis,” arXiv preprint:1408.2156, 2014.
  • [15] X. Yi, C. Caramanis, and S. Sanghavi, “Alternating minimization for mixed linear regression,” in Proceedings of The 31st International Conference on Machine Learning, 2014, pp. 613–621.
  • [16] N. Städler, P. Bühlmann, and S. Van De Geer, “-penalization for mixture regression models,” Test, vol. 19, no. 2, pp. 209–256, 2010.
  • [17] S. Faria and G. Soromenho, “Fitting mixtures of linear regressions,” Journal of Statistical Computation and Simulation, vol. 80, no. 2, pp. 201–225, 2010.
  • [18] Y. Chen, X. Yi, and C. Caramanis, “A convex formulation for mixed regression with two components: Minimax optimal rates,” arXiv preprint arXiv:1312.7006, 2013.
  • [19] A. T. Chaganty and P. Liang, “Spectral experts for estimating mixtures of linear regressions,” in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1040–1048.
  • [20] E. Arikan, “Channel polarization a method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,” IEEE Transactions on Information Theory,, vol. 55, no. 7, 2009.
  • [21] X. Li and K. Ramchandran, “An active learning framework using sparse-graph codes for sparse polynomials and graph sketching,” in Advances in Neural Information Processing Systems, 2015, pp. 2161–2169.
  • [22] R. Pedarsani, K. Lee, and K. Ramchandran, “Sparse covariance estimation based on sparse-graph codes,” in Annual Allerton Conference on Communication, Control, and Computing, 2015.
  • [23] S. Ubaru, A. Mazumdar, and Y. Saad, “Low rank approximation using error correcting coding matrices,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 702–710.
  • [24] S. Ermon, C. Gomes, A. Sabharwal, and B. Selman, “Low-density parity constraints for hashing-based discrete integration,” in Proc. 31st International Conference on Machine Learning, 2014, pp. 271–279.
  • [25] X. Li, S. Pawar, and K. Ramchandran, “Sub-linear time support recovery for compressed sensing using sparse-graph codes,” arXiv preprint arXiv:1412.7646, 2014.
  • [26] R. Pedarsani, K. Lee, and K. Ramchandran, “Phasecode: Fast and efficient compressive phase retrieval based on sparse-graph-codes,” arXiv preprint arXiv:1408.0034, 2014.
  • [27] I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Nearest neighbor based greedy coordinate descent,” in Advances in Neural Information Processing Systems, 2011, pp. 2160–2168.
  • [28] M. Pilanci and M. J. Wainwright, “Iterative hessian sketch: Fast and accurate solution approximation for constrained least-squares,” arXiv preprint arXiv:1411.0347, 2014.
  • [29] D. Achlioptas and P. Jiang, “Stochastic integration via error-correcting codes,” in Proc. Uncertainty in Artificial Intelligence, 2015.
  • [30] W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space,” Contemporary mathematics, vol. 26, no. 189-206, p. 1, 1984.


Appendix A Details of Experiments

In this section, we provide more details of the experiments that we conducted. All simulations are done on a laptop with 2.8 GHz Intel Core i7 CPU and 16 GB memory using Python.

In Figure 5, we test the success probability and running time in the noiseless case. In both Figure 4(a) and Figure 4(b), we use for , for , for . In Figure 4(b), we use for , for , for .

In Table 2, we compare the sample and time complexities of the Mixed-Coloring algorithm and the alternating minimization algorithm. We use , and . The parameters for alternating minimization are chosen as suggested in the original paper [15].

In Figure 6, we test the sample and time complexities of the Robust Mixed-Coloring algorithm. In both Figure 5(a) and Figure 5(b), we choose quantization level , standard deviation of noise , algorithm parameters: , , number of singleton verification query vectors: . In Figure 5(a), we vary to find the minimum number of query vectors needed for successful recovery. In Figure 5(b), we fix and test the time cost.

In Figure 7, we test the performance of Robust Mixed-Coloring algorithm with quantization assumption violated. We vary the number of bins to test the empirical probability of success, and also keep . Other parameters: , , quantization level , standard deviation of noise , number of singleton verification query vectors: , and .

Appendix B Proof of Theorem 1

b.1 Proof Outline

We prove Theorem 1 in this section. The proof includes two major steps: (i) show that the expectation of the fraction of non-zero elements which are not recovered can be arbitrarily small; (ii) show that this fraction concentrates around its mean with high probability. The first part mainly uses density evolution techniques which is commonly used in coding theory, and the second part uses Doob’s martingale argument.

b.2 Notations

We briefly recall the Mixed-Coloring algorithm in the noiseless case and declare some notations that we will use for the rest of the proof.

Recall that the parameter vector has non-zero elements. We call these non-zero elements balls in color . We design a -left regular bipartite graph with left nodes and right nodes, representing the coordinates and the bins, respectively. We denote the -th bin by . We use the matrix to represent the biadjacency matrix of the bipartite graph, i.e., if and only if the -th bin is associated with the -th coordinate. Recall that we design three query vectors for in the form of (2), for the purpose of ratio test. The third query vectors is the summation of the first two and is used for summation check. We repeat the first two query vectors times, respectively, and get type-I and type-II index measurements. We repeat the third query vector times and get verification measurements. For the -th verification measurement of the -th bin, we define a sub-bin . If we can find one type-I index measurement and one type-II index measurement such that the summation of the two measurements is equal to the -th verification measurement, we know that these three measurements are generated by the same parameter vector, say . The two index measurements are called a consistent pair. Then, we say that the sub-bin has color . We define the color set of . If we can find a consistent pair corresponding to the -th verification measurement, we let , otherwise . We further define the color set of bin as .

b.3 Number of Singleton Balls

In this section, we analyze the number of singleton balls in color found in the first stage of the algorithm. We can show that this number is concentrated around a constant fraction of with high probability.

Lemma 1.

Let be the number of singleton balls in color found in the first stage. Then, there exists a constant333Recall that in our paper, constants are defined as quantities which do not depend on and . such that for any constant ,


We first specify some terminologies here. For a bin , we say that this bin has color when . One should notice that if there are more than one sub-bins in color in bin , these sub-bins are identical, and therefore, we can say that a bin is contains balls in color , when has at least one sub-bin in color , and the sub-bin is associated with non-zero elements in , or equivalently, the coded parameter vector satisfies , .

First, we analyze the probability that a particular bin has color . According to our model, the measurements are generated independently, therefore, we have

Then, we use to denote the probability of the event that a particular bin contains balls in color . Since each ball is associated with bins among the bins independently and uniformly at random, the number of balls in color that a bin contains is binomial distributed with parameters and , and we have

In addition, we can use Poisson distribution to approximate the binomial distribution when is a constant and approaches infinity. In the following analysis, we will use the approximation

Consider the bipartite graph representing the association between the balls in color and the bins. We know that there are edges connected to the balls in color , and we use to denote the expected fraction of these edges which are connected to a bin which contains balls in color , . Then, we have

and equivalently, is also the probability that an edge, which is chosen from the edges uniformly at random, is connected to a bin containing balls in color .

Let be the probability that a ball in color is a singleton ball. The event that this ball is a singleton ball is equivalent to the event that at least one of its associated bins contains one ball color . Then, when approaches infinity, we have

and this is because in the limit , the correlations between the edges connected to a ball become negligible; this technique is often used in the theoretical analysis of density evolution in coding theory, and we will use this type of asymptotic argument several times in the proofs. Let be the number of singleton balls in color , then we have . Using the asymptotic argument and by Hoeffding’s inequality, we also have for any constant ,

and this means that the number of singleton balls in color is highly concentrated around . ∎

b.4 Initial Fractions

We construct the graph whose nodes correspond to the singleton balls in color found in the previous stage, and analyze the number of edges in , which is equal to the number of strong doubletons in color . Then, we can show that the number of strong doubletons is concentrated around a constant fraction of with high probability.

Lemma 2.

Let be the number of strong doubletons in color found in the second stage. Then, there exists a constant such that for any constant ,


We know that the expected number of doubletons in color is . Then, we analyze the probability that a doubleton is a strong doubleton. Similar to the analysis in [26], for a particular ball in color , we let denote the event that this ball is in a singleton, and denote the event that this ball is in a doubleton. We have the conditional probability that a ball in a doubleton is also a singleton ball:

Then we know the probability that a doubleton is a strong doubleton is , and the expected number of strong doubletons in color is . Let and be the number of edges in graph . The expectation of is , and according to Hoeffding’s inequality, we have for any

meaning that the number of edges is highly concentrated around . ∎

Then, we get the following result on the size of the giant component of , using the asymptotic behavior of the Erdos-Renyi random graphs.

Lemma 3.

Let be the size of the largest connected component (giant component) of . If the parameters of the Mixed-Coloring algorithm satisfy


then, for any constant , with probability , initial fraction of the balls in color which are recovered after the second stage satisfies


where the constant is the unique solution of the equation

and other connected components in are of sizes .


This result is a direct corollary of the asymptotic behavior of the Erdos-Renyi random graphs,and we only give a brief proof here. First, we condition on the number of singleton balls that we find in the first stage, i.e., and the number of edges in , i.e., . By symmetry, we know that the edges are uniformly chosen from the possible edges. Therefore, the graph is an Erdos-Renyi random graph. According to the results on the giant component of Erdos-Renyi random graphs, we know that if the limit

then with probability , the size of the giant component of graph is linear in , and other connected components have sizes . By (4) and (5), we know that for any constant , the limit lies in the interval , with probability , for some constant . Then, we can get rid of the conditioning and complete the proof of Lemma 3. ∎

b.5 Tree-like Assumption

By Lemma 3, we know that we can recover a constant fraction of the non-zero elements with probability . Then, we study the iterative decoding process. The analysis is based on density evolution, which is a common and powerful technique in coding theory. Similar to other coding-theoretic analysisour derivation of density evolution is based on a tree-like assumption. Here, we state the tree-like assumption first and provide the results on the probability that the tree-like assumption holds.



Figure 8: Level-2 neighborhood of edge .

As we have mentioned, the association between the balls in color (non-zero elements in ) and the bins can be represented by a -left regular bipartite graph. We label the edges by an ordered pair of a ball and a bin , denoted by . We define the level- neighborhood of , denoted by as the subgraph of all the edges and nodes on paths with length less than or equal to , which start from and the first edge of the paths are not  [26]. We have the following results on the probability that is a tree, or equivalently, cycle-free, for a constant .

Lemma 4.

[26] For a fixed constant , is a tree with probability at least .

We conduct the density evolution analysis conditioned on the event that is a tree for an edge which is chosen from the edges uniformly at random. Then, we will take the complementary event into consideration and complete the analysis.

b.6 Analysis on the Density Evolution

Recall that in the first iteration, we find all the singletons, and in the second iteration, we find the strong doubletons and form the giant component. Let be the probability that at the th iteration of the learning algorithm, a ball in color , which is chosen from the balls uniformly at random, is not recovered, . Here, corresponds to the probability that after the second iteration, a randomly chosen ball in color is not in the giant component. According to the previous section, we know that by choosing parameters which satisfy (6), we have with probability . Now we analyze the relationship between and for .

Consider the iterative decoding process as a message passing process. First, we know that at iteration , a ball in color passes a message to a bin through an edge claiming that it is colored, if and only if at least one of the other neighborhood bins contains a resolvable multiton in color . Second, a sub-bin in color becomes a resolvable multiton if and only if all the other balls in this sub-bin are colored. This message passing process is illustrated in Figure 8. Under the tree-like assumption, the messages passed among the balls and bins are independent, we have

which gives us


As we can see, the major difference between the density evolution of the Mixed-Coloring algorithm and the PhaseCode algorithm is that there is a constant probability that a bin has a sub-bin in color .

Next, we will show that after a constant number of iterations, can be arbitrarily small.

Lemma 5.

If we choose parameters satisfying


then for any constant , there exists a constant , such that .


Let , then we have . It is easy to see that , , and is a monotonically increasing function. We also have

We know that if there is


then there exists at least one fixed point such that . We use to represent the largest fixed point of in . Now we argue that the fixed point can be made arbitrarily small by choosing proper parameters. Suppose that for a certain set of parameters and , the fixed point is , then if we keep and increase to , where is a constant, then we can see that the new fixed point is upper bounded by , and in this way, the fixed point can be made an arbitrarily small constant. As shown in [26], as long as we can choose parameters to make the fixed point , then, there exists a constant number of iterations , depending on , such that . ∎

Then, we can prove the following lemma showing that the number of uncolored balls in color is concentrated around with high probability.

Lemma 6.

Let be the number of uncolored balls in color after iterations. Then for any , there exists constant , such that when conditioned on the event that , and is large enough,


The proof of Lemma 6 is the same as in [26], and uses Doob’s martingale argument and Azuma’s concentration bound. We should also notice that the event that the tree-like assumption does not hold is already considered in (11). Now combining Lemmas 3, 5, and 6, we have shown that for a specific color , there exists proper parameters of the algorithm such that after a constant number of iterations, the Mixed-Coloring algorithm can recover an arbitrarily large fraction of the balls in color with probability . Using a union bound over all the colors ( is a constant), we have proved the results in Theorem 1 on the error probability.

b.7 Computational Complexity

In this part, we analyze the computational complexity of the algorithm. First, since there are bins and each bin has a constant number of sub-bins. Refining the measurements of each bin takes operations, the computational complexity of refining measurements is . Next, to find all the singletons, we need to check all the colored sub-bins, and checking each sub-bin takes operations, the computational complexity of this stage is . In the third stage, we find all the strong doubletons. We know that there are singleton balls and for each singleton ball, there are bins connected to it. For each of the bins, we subtract the measurements contributed by the singleton ball from the refined measurements in the sub-bins, and do the ratio test to see if it is a strong doubleton. Therefore, processing each bin takes operations and since is also a constant, the computational complexity of finding strong doubletons is also . Then, we get the graph with nodes and edges, corresponding to the singleton balls and strong doubletons, respectively. Using breadth-first search algorithm, the computational complexity of finding the connected components is . In the last stage, we iteratively find other uncolored balls. For each unprocessed sub-bin, since we do not know the color of the sub-bin, there are possible remaining measurements. Each time when we find a new ball, we update at most remaining measurements and do the ratio test. Therefore, it takes operations when coloring a new ball. Since there are uncolored balls after finding the giant components, the computational complexity of the last stage is also . So far, we have shown that the computational complexity of Mixed-Coloring algorithm is , which completes the proof of Theorem 1.

Appendix C Computing the Constants in the Sample Complexity

In this section, we give exact constants in the sample complexity results. For simplicity, we assume that and for all . We define a new notation , and then there is . We will analyze the minimum number of measurements that we need to reach a certain reliability target. More precisely, we set the maximum error floor to be , and numerically calculate the error floor for different values of , , , and . Then, we minimize the number of total measurements, which is proportional to with the constraint that the error floor . As we have shown in previous parts, the parameters should also satisfy (6) and (9). We know that if (6) is satisfied, when is large enough, there should be a giant component with size linear in for each color. where is a threshold that we can choose. Therefore, we select parameters with three constraints, which are (9), (6), and .

11 12 13 14 15 16 17 18
6.7 8.7 1.9 3.1 5.1 1.6 0.5 7.4
2.95 3.17 3.23 3.46 3.71 3.78 3.86 4.37
4 4 4 3 3 3 3 3
4 3 3 4 3 3 3 2
35.4 34.87 35.53 34.6 33.39 34.02 34.74 34.96
11 12 13 14 15 16 17 18
4.4 5.2 2.7 9.2 8.8 2.8 6.2 2.3
1.94 2.08 2.17 2.39 2.52 2.56 2.76 2.81
7 6 6 5 5 5 5 5
7 7 6 6 5 5 4 4
40.74 39.52 39.06 38.24 37.80 38.4 38.64 39.34
11 12 13 14 15 16 17 18
7.8 8.7 8.1 5.6 4.2 3.3 4.0 5.0
1.48 1.59 1.68 1.76 1.85 1.93 2.04 2.16
9 9 8 8 7 7 7 6
11 8 8 7 8 7 6 7
42.92 41.34 40.32 40.48 40.7 40.53 40.8 41.04
Table 3: Constants in the results of sample complexity.

The results of the numerical calculation are shown in Table 3. In these experiments, we set , , and we fix the left degree and choose different values of , , and to minimize the number of measurements with the three constraints. Then we compare the optimal number of measurements over different choices of and find the optimal . As we can see, to reach the same reliability level, for , the optimal number of measurements we need is , , and , respectively. The number of measurements we need only increases slightly with , and the optimal is around 13 and 15.

Appendix D Details of Noisy Recovery Algorithm

In this section, we provides more details to show we robustify the Mixed-coloring algorithm in presence of noise. The overall structure of the algorithm is the same as the noiseless case. However, one can see that the ratio test method that we use for indexing in the noiseless case and the summation check approach are both fragile to noise. Therefore, we need different design of query vectors. The main idea to robustify the algorithm is to encode the location information using binary representations, i.e., binary bits, rather than the relative phases. Similar methods have been used in problems such as compressive phase retrieval [13]. Further, instead of consistent pairs, we find consistent sets of measurements using consecutive summation check.

Design of Queries.

We still design the query vectors according to the -left regular bipartite graph. For a particular bin, let denote the association between this bin and the coordinates. We design query vectors , for this bin as follows: