Learning Mixtures of Sparse Linear Regressions Using Sparse Graph Codes
Abstract
In this paper, we consider the mixture of sparse linear regressions model. Let be unknown sparse parameter vectors with a total of nonzero coefficients. Noisy linear measurements are obtained in the form , each of which is generated randomly from one of the sparse vectors with the label unknown. The goal is to estimate the parameter vectors efficiently with low sample and computational costs. This problem presents significant challenges as one needs to simultaneously solve the demixing problem of recovering the labels as well as the estimation problem of recovering the sparse vectors .
Our solution to the problem leverages the connection between modern coding theory and statistical inference. We introduce a new algorithm, MixedColoring, which samples the mixture strategically using query vectors constructed based on ideas from sparse graph codes. Our novel code design allows for both efficient demixing and parameter estimation. The algorithm achieves the orderoptimal sample and time complexities of in the noiseless setting, and nearoptimal complexities in the noisy setting. In one of our experiments, to recover a mixture of two regressions with dimension and sparsity , our algorithm is more than times faster than EM algorithm, with about of its sample cost.
1 Introduction
Mixture and latent variable models, such as Gaussian mixtures and subspace clustering, are expressive, flexible, and widely used in a broad range of problems including background modeling [1], speaker identification [2] and recommender systems [3]. However, parameter estimation in mixture models is notoriously difficult due to the nonconvexity of the likelihood functions and the existence of local optima. In particular, it often requires a large sample size and many reinitializations of the algorithms to achieve an acceptable accuracy.
Our goal is to develop provably fast and efficient algorithms for mixture models — with sample and time complexities sublinear in the problem’s ambient dimension when the parameter vectors of interest is sparse — by leveraging the underlying lowdimensional structures.
In this paper we focus on a powerful class of models called mixtures of linear regressions [4]. We consider the sparse setting with a querybased algorithmic framework. In particular, we assume that each querymeasurement pair is generated from a sparse linear model chosen randomly from possible models:^{1}^{1}1We use to denote the conjugate transpose of , and the set of integers .
(1) 
where is noise. The total number of nonzero elements in the parameter vectors is assumed to be . The goal is to estimate the ’s, without knowing which generates each querymeasurement pair.
A mixture of regressions provides a flexible model for various heterogeneous settings where the regression coefficients differ for different subsets of observations. This model has been applied to a broad range of tasks including medicine measurement design [5], behavioral health care [6] and music perception modeling [7]. Here, we study the problem when the query vectors can be designed by the user; in Section 1.2 we discuss several practical applications that motivate the study of this querybased setting. Our results show that by appropriately exploiting this design freedom, one can achieve significant reduction the sample and computational costs.
To recover unknown nonzero elements, it is clear that the amount of measurements and time required scale at least as . We introduce a new algorithm, called the MixedColoring algorithm, that matches these sublinear sample and time complexity lower bounds. The design of query vectors and decoding algorithm leverages ideas from sparse graph codes such as lowdensity paritycheck (LDPC) codes [8]. Our algorithm recovers the parameter vectors with optimal sample and time complexities in the noiseless setting, both in theory and empirically, and is stable under noise with nearoptimal sample and time complexities. Prior literature on this problem that does not utilize the design freedom typically have sample/time complexities that are at least polynomial in ; we provide a survey of prior work and a more detailed comparison in Section 6. Empirically, we find that our algorithm is orders of magnitude faster than standard ExpectationMaximization (EM) algorithms for mixture of regressions. For example, in one of our experiments, detailed in Section 5, we consider recovering a mixture of two regressions with dimension and sparsity ; our algorithm is more than times faster than EM algorithm, with about of its sample cost.
1.1 Algorithm Overview
Our MixedColoring algorithm solves two problems simultaneously: (i) rapiddemixing, namely identifying the label of the vector that generates each measurement ; (ii) efficient identification of the location and value of the nonzero elements of the ’s. The main idea is to use a divideandconquer approach that iteratively reduce the original problem into simpler ones with much sparser parameter vectors. More specifically, we design sets of sparse query vectors, with each set only associated with a subset of all the nonzero elements. The design of the query vectors ensures that we can first identify the sets which are associated with a single nonzero element (called singletons), and recover the location and value of that element (we call them singleton balls, shown as shaded balls in Figure 0(b)). We further identify the pairs of singleton balls which have the same (but unknown) label, indicated by the edges in Figure 0(b). Results from random graph theory guarantees that, with high probability, the largest connected components (giant components) of the singleton graph have the different labels, and thus we recover a fraction of the nonzero elements in each , as shown in Figure 0(c). We can then iteratively enlarge the recovered fraction with a guessandcheck method until finding all the nonzero elements. We revisit Figure 1 when describing the details of our algorithm in Section 3.
1.2 Motivation
Our problem is a natural extension of the setting of compressive sensing,^{2}^{2}2Compressive sensing is a special case of our problem with . in which one often has full freedom of designing query vectors in order to estimate a sparse parameter vector. In many applications, the unknown sparse parameter vector can be affected by latent variables, leading to a mixture of sparse linear regressions, and these scenarios have been observed in neuroscience [9], genetics [10], psychology [5], etc. Here, we provide a concrete example motivated by neuroscience applications [9]. In neural signal processing, sensors are used to measure the brain activities, represented by an unknown sparse vector . The sensors can be modeled as digital filters, and one can design the linear filter weights (’s) when measuring the neural signal. Multiple sensors are usually placed in a particular area of the brain in order to acquire enough compressed measurements. However, there may be more than one neuron affecting a particular area of the brain, as shown in Figure 2, and each neuron may have different activities, corresponding to a different . Consequently, each sensor may be measuring one of several different sparse signals, which can be formulated as a mixtureofsparselinearregressions problem. Variants of this problem, such as neural spike sorting [9], has been studied in neuroscience. While the common solution is to use clustering algorithms on the spike signals, we believe that our algorithm provides the potential of improving sensor design and reducing sample and time complexities.
In addition, our work adds the intellectual value of the power of design freedom in tackling sparse mixture problems by highlighting the huge performance gap between algorithms that can exploit the design freedom and those that cannot. We also believe that our ideas are applicable more broadly for other latentvariable problems that require experimental designs, such as survey designs in psychology with mixed type of respondents and biology experiments with mixed cell interior environments.
2 Main Results
In this section, we present the recovery guarantees for the MixedColoring algorithm, and provide bounds on its sample and time complexities. We assume there are unknown dimensional parameter vectors . Each has nonzero elements, i.e., . Let be the total number of nonzero elements. Using the query vectors , the MixedColoring algorithms obtains measurements , generated independently according to the model (1), and outputs an estimate , of the unknown parameter vectors. We defer more details to Sections 3 and 4.
Our results are stated in the asymptotic regime where and approach infinity. A constant is a quantity that does not depend on and , with the associated BigO notations and . We assume that is a known and fixed constant, and the mixture weights satisfy for each and thus are of the same order. Similarly, the sparsity levels of the parameter vectors are also of the same order with .
2.1 Guarantees for the Noiseless Setting
In the noiseless case, i.e., , we consider for generality the complexvalued setting with (our results can be easily applied to real case). We make a mild technical assumption, which stipulates that if any pair of parameter vectors have overlapping support, then the elements in the overlap are different.
Assumption 1.
For each pair , and each index , we have .
Under the above setting, we have the following recovery guarantees for the MixedColoring algorithm.
Theorem 1.
Consider the asymptotic regime where and approach infinity. Under Assumption 1, for any fixed constant , there exists a constant such that if the number of measurements is , then the MixedColoring algorithm satisfies the following three properties for each (up to a label permutation):

(No False Discovery) For each , equals either or ; for each , .

(Elementwise Recovery) There exists a constant such that for each .

(Support Recovery)
Moreover, the computational time of the MixedColoring algorithm is .
The theorem ensures that the MixedColoring algorithm has no false discovery, and recovers fraction of the nonzero elements with high probability. The error fraction is an input parameter to algorithm, and can be made arbitrarily close to zero by adjusting the oversampling ratio . (By more careful analysis, one can show that the dependence of on is . Here, since we set as a constant, is a constant.) Given the number of components , mixture weights and the target , the value of the constant can be computed numerically. The table below gives some of the values for several and , under the setting . We see that the value of is quite modest.
We can in fact boost the above guarantee to recover all the nonzero elements, by running the MixedColoring algorithm times independently and aggregating the results by majority voting. By property 2 in Theorem 1 and a union bound argument, this procedure exactly recovers all the parameter vectors with probability with sample and time complexities.
2.2 Guarantees for the Noisy Setting
An extension of the previous algorithm, Robust MixedColoring, handles noise in the measurement model (1). Here we focus on the case with two parameter vectors which appear equally likely, i.e., and , . Many interesting applications have binary latent factors: gene mutation present/not, gender, healthy/sick individual, children/adult, etc. The noise is assumed to be i.i.d. Gaussian with mean zero and constant variance . For the purpose of theoretical analysis, we assume that the nonzero elements in the parameter vectors take value in a finite quantized set.
Assumption 2.
The nonzero elements of the parameter vectors satisfy , where
The positive constants and are known to the algorithms.
As shown in our empirical results in Section 5, the Robust MixedColoring algorithm works even when the assumption is violated. In this case, the algorithm produces the best quantized approximation to the unknown parameter vectors, provided that they are not too far off the quantized set. The theoretical results for the continuous alphabet setting is still an open problem, and the tools in recent work such as [11] may be applied to our problem.
When the quantization assumption holds, exact recovery is possible, as guaranteed in the theorem below. The Robust Mixed Coloring algorithm maintains sublinear sample and time complexities, and recovers the parameter vectors in the presence of noise with bounded variance.
Theorem 2.
Consider the asymptotic regime where and approach infinity with for some constant .When and Assumptions 1 and 2 hold, there exists a constant , such that if and the number of measurements is , then the Robust MixedColoring algorithm satisfies the three properties in Theorem 1. Moreover, the computational time of the Robust MixedColoring algorithm is .
Similar to the noiseless case, by running the Robust MixedColoring algorithm times, one can exactly recover the two parameter vectors with probability . In this case, the sample and computational complexities are , and further, since we assume that for some constant , we can still conclude that the sample and computational complexities for full recovery are .
3 MixedColoring Algorithm for Noiseless Recovery
In this section, we provide details of the MixedColoring algorithm in the noiseless setting. We first provide some primitives that serve as important ingredients in the algorithm, and then describe the design of query vectors and decoding algorithm in detail.
3.1 Primitives
The algorithm makes uses of four basic primitives: summation check, indexing, peeling, and guessandcheck, which are described below.
Summation Check: Suppose that we generate two query vectors and independently from some continuous distribution on , and a third query vector of the form . Let , , and be the corresponding measurements. We check the sum of the measurements and in the noiseless case, if , then with probability one, we know that these three measurements are generated from the same parameter vector . In this case we call a consistent pair of measurements as they are from the same (the third measurement is now redundant).
Indexing: The indexing procedure is to find the locations and values of the nonzero elements by carefully designed query vectors. In the noiseless case, this can be done by suitably designed ratio test. We sketch the idea of the ratio test here. Consider a consistent pair of measurements and corresponding query vectors . We design the query vectors such that the information of the locations of the nonzero elements is encoded in the relative phase between and . In particular, we generate i.i.d. random variables uniformly distributed on the unit circle. Letting where is the imaginary unit, we set the th entries of and to be either , or and . (The locations of the zeros are determined using sparsegraph codes and discussed later.) Below is an example of such a consistent pair of measurements and the corresponding linear system:
(2) 
Suppose that is sparse and of the form . There is only one nonzero element, , that contributes to the measurements and . In this case the consistent measurement pair is called a singleton. A singleton can be detected by testing the integrality of the relative phase of the ratio . In the above example, since and , we observe that and the relative phase is an integral multiple of . We therefore know that with probability one, this consistent pair is a singleton, and moreover the corresponding nonzero element is located at the rd coordinate with value . We would like to remark that the indexing step can also be done using realvalued query vectors.
Peeling: The third ingredient of the decoder is peeling, i.e., iteratively reducing the problem by subtracting off recovered elements, in a Gaussian eliminationlike manner. In the example above, suppose instead that is sparse, i.e., , in which case the consistent pair
(3) 
is associated with two nonzero elements of . If in a previous iteration of the algorithm we have recovered the location and value of , then we can subtract/peel off this recovered element by , for .
The updated measurement pairs satisfy , and we have reduced the problem to a simpler form. In fact, in this case the pair becomes a singleton, to which the above ratio test can be applied to recover .
Guessandcheck: The ratio test and peeling steps can be combined to detect that two nonzero elements are from the same parameter vectors. In the previous example (3), suppose instead that we recovered two elements and in previous iterations via ratiotesting another two consistent pairs that are singletons, but values of their labels and are unknown. We can still try to peel off from ; if the updated measurements pass the ratio test and recover a nonzero element with location and value , then we know that with probability one the nonzero elements and must come from the same parameter vector (the one that generates ), i.e., . In this case the peeling step is valid.
The continuing execution of these four primitives is made possible by the design of the query vectors using sparsegraph codes, which we describe next.
3.2 Design of Query Vectors
As illustrated in Figure 3, we construct sets of query vectors (called bins). The query vectors in each bin are associated with some coordinates of the parameter vectors (i.e., the queries are nonzero only on those coordinates). The association between the coordinates and bins is determined by a left regular bipartite graph with left nodes (coordinates) and right nodes (bins), where each left node is connected to right nodes chosen independently uniformly at random. Each bin consists of three query vectors. The values of the nonzero elements of the first two query vectors are in the form of (2), enabling the ratio test. The third query vectors equals the sum of the first two and is used for the summation check.
If the query vectors in each bin were used only once, then we would have very few bins passing the summation check and hence few consistent pairs. Instead, we use the first two query vectors repeatedly for times, obtaining two sets of measurements, each of size and called typeI and typeII index measurements. We use the third query vector times to obtain a set of verification measurements. We therefore have measurements associated with each of the bins, hence a total of measurements, as shown in Figure 3. Using density evolution methods [12], we can find proper values of , , , and such that successful recovery is guaranteed.
3.3 Decoding Algorithm
The decoding algorithm first finds consistent pairs (by summation check) in each bin, within which singletons are identified (by the ratio test). The ratio test also recovers the location and values of several nonzero elements, some of which can then be associated with the same by guessandcheck. At this point, for each , we have recovered some of its nonzero elements (including their locations, values and labels). These steps are then repeated iteratively via peeling until no more nonzero elements can be found. Below we elaborate on these steps.
Finding Consistent Pairs:
The decoding procedure starts by finding all the consistent pairs. In each bin, we perform summation checks on all triplets in which , , and are the typeI index measurement, typeII index measurement and verification measurement, respectively. If a triplet passes the summation check, then a consistent pair is found. Note that in each bin the number of triplets of the above form is a constant, so this step can be done in time. The subsequent steps of the algorithm are based on the consistent pairs found in this step.
Recovering a Subset of Nonzero Elements:
Each nonzero element of the parameter vectors can be identified by its labellocationvalue triplet . We visualize these triplets (i.e., nonzero elements) as balls, as shown in Figure 0(a), and initially their labels, locations and values are unknown. As before, a consistent pair associated with only one nonzero element is called a singleton, and we call this nonzero element a singleton ball. We run the ratio test on the consistent pairs to identify singletons and their associated singleton balls. The singleton balls found are illustrated in Figure 0(b) as shaded balls. The ratio test also recovers the locations and values of these singleton balls, although at this point we do not know the label of the balls.
The next step is crucial: For two singleton balls and a consistent measurement pair associated with the locations of these two balls, we run the guessandcheck operations to detect if these two singleton balls indeed have the same label (or equivalently, if the two nonzero elements are in the same parameter vector). If so, we connect these two balls with an edge, as shown in Figure 0(b). Doing so creates a graph over the balls (i.e., nonzero elements), and each connected component of the graph is from a single parameter vector. Since each nonzero element is associated with a constant number of consistent pairs (due to using a left regular bipartite graph with constant ), this step can in fact be done efficiently in time without enumerating all the combinations of singleton ball pairs.
By carefully choosing the parameters , , , and , and using tools from random graph theory, we can ensure that with high probability the largest connected components (called giant components) correspond to the parameter vectors, and each of these components has size . Then, the labels of the balls in these components are now identified. This is illustrated in Figure 0(c) for , where colors represent the labels. In summary, at this point we have recovered the labels, locations and values of a constant fraction of the nonzero elements (i.e., balls) of each parameter vector.
Iterative Decoding:
The decoding procedure proceeds by identifying the labels of the remaining balls via iteratively applying the peeling and guessandcheck primitives. The connected components in Figure 0(c) are therefore expanded, until no more changes can be made, as illustrated in Figure 0(d).
We provide an example of this iterative procedure in Figure 4. Recall that the association between the coordinates of the parameter vectors and the bins (or consistent pairs) is determined by a bipartite graph. Here, we only show one consistent pair for each bin and omit the zero elements. The nonzero elements and the consistent pairs are shown as balls and squares, respectively, as in Figure 3(a). The steps described in the last part recover a subset of these balls, which are shown in colors in Figure 3(b). Now consider the measurement pair 1, which is associated with the balls , and . As and are recovered, we can peel them off from the measurement pair 1 to recover (by the ratio test) the label, location and value of the nonzero element represented by ball . Similarly, peeling off the recovered ball from the measurement pair , recovers ball , as illustrated in Figure 3(c). We continue this process iteratively, peeling off balls recovered in the previous iterations to recover more balls. For example, we peel off the balls and from the measurement pair to recover the ball , and the ball from pair to recover ball , resulting in Figure 3(d). So far we have described the MixedColoring algorithm in the noiseless case, and we refer readers to Section B of the appendices for the analysis of this algorithm.
4 Robust MixedColoring Algorithm for Noisy Recovery
The overall structure of the Robust MixedColoring algorithm is the same as its noiseless counterpart. In the presence of noise, the ratio test method for indexing and the summation check primitive need to be robustified, which are done by a modification of the query design. In particular, we design three types of query vectors. The first type, called binary indexing vectors, encodes the location information using binary representations with, bits (as opposed to using the relative phases in the noiseless case). A similar approach is considered in [13] for compressive phase retrieval. The second type is called singleton verification vectors, which are used for singleton detection. Using these two types of vectors we can modify the ratio test to achieve the same performance with noise. The third type of query vectors is used for consecutive summation check, which finds consistent sets of measurements.
In addition to the new query design, we also employ a noise reduction scheme. This is done by using each designed query vector (say ) repeatedly for times and averaging the corresponding measurements from the same . In particular, these measurements are sampled i.i.d. from a mixture of two Gaussians with centers and , so we use an EM algorithm initialized by moment methods to estimate the two centers. Using the result in [14], we prove that the EMbased noise reduction scheme succeeds under the conditions in Theorem 2, namely and . We refer the readers to Section D of the appendices for the details of the Robust MixedColoring algorithm, and Section F for more details of the EM algorithm that we use.
5 Experimental Results
In this section, we test the sample and time complexities of the MixedColoring algorithm in both noiseless and noisy cases to verify our theoretical results. We refer the readers to the appendices for more details of the experiments.
For the noiseless case, we use the optimal parameters from numerical calculations of the density evolution. For different values of , we record the empirical success probability and running time averaged over trials. Here, we use a sufficiently small so that the success event is equivalent to recovery of all the nonzero elements. The results are shown in Figure 4(a). The phase transition occurs at some that matches the values in Table 1 predicted by our theory. Moreover, the running time is linear in and does not depend on , as shown in Figure 4(b).
Similar experiments are performed for the noisy case using the Robust MixedColoring algorithm, under the quantization assumption. Figure 5(a) shows the minimum number of queries required for 100 consecutive successes, for different and . We observe that the sample complexity is linear in and sublinear in . The running time exhibits a similar behavior, as shown in Figure 5(b). Both observations agree with the prediction of our theory.
We also compare the MixedColoring algorithm with a stateoftheart EMstyle algorithm (equivalent to alternating minimization in the noiseless setting) from [15]. These comparisons are not entirely fair, since our algorithm is based on carefully designed query vectors, while the algorithm in [15] uses random design, i.e., the entries of ’s are i.i.d. Gaussian. However, this is exactly where the intellectual value of our work lies: we expose the gains available by careful design. We consider four test cases with , with the first two cases being sparse problems and the last two being relatively dense problems. We find the minimum number of queries that leads to a 100% successful rate in 100 trials, and the average running time. As shown in Table 2, in both sparse and dense problems, our MixedColoring algorithm is several orders of magnitude faster. As for the sample complexity, our algorithm requires smaller number of samples in the sparse cases, while in dense problems, the sample complexity of our algorithm is within a constant factor (about 3) of that of the alternating minimization algorithm. For the noisy setting, our algorithm is most powerful in the high dimensional setting, i.e., large , due to the factors. However, in this setting, it takes extremely long time for the stateoftheart algorithms such as [16] to converge, and thus, we do not present the comparison in the noisy setting.
(MCMixedColoring)
We further test the Robust MixedColoring algorithm when the quantization assumption is violated. For any , we define , where denotes the indicator function. This means that is the element in which is the closest one to , when . For a vector , we define . We define the perturbation of a vector as .
In this experiment, we generate sparse parameter vectors , with a total number of nonzero elements. These nonzero elements are generated randomly while keeping the perturbation of the parameter vectors under a certain level by adding bounded noise to the quantized nonzero elements. We record the probability of success for different number of bins and different perturbation level. Here the success event is defined as recovery of for all . The result is shown in Figure 7. We see that the Robust MixedColoring algorithm works without the quantization assumption as long as the perturbations are not too large.
6 Related Work
6.1 Mixtures of Regressions
Parameter estimation using the expectationmaximization (EM) algorithm is studied empirically in [17]. In [16], an penalized EM algorithm is proposed for the sparse setting. Theoretical analysis of the EM algorithm is difficult due to nonconvexity. Progress was made in [15] and [14] under stylized Gaussian settings with dense , for which a sample complexity of is proved given a suitable initialization of EM. The algorithm uses a grid search initialization step to guarantee that the EM algorithm can find the global optimal solution, with the assumption that the query vectors are i.i.d. Gaussian distributed. The computational complexity is polynomial of . An alternative algorithm is proposed in [18], which achieves optimal sample complexity, but has high computational cost due to the use of semidefinite lifting. The algorithm in [19] makes use of tensor decomposing techniques, but suffers from a high sample complexity of . In comparison, our approach has order optimal sample and time complexities by utilizing the potential design freedom.
6.2 Codingtheoretic Methods
Many modern errorcorrecting codes such as LDPC codes and polar codes [20] with their roots in communication problems, exploit redundancy to achieve robustness, and use structural design to allow for fast decoding. These properties of codes have recently found applications in statistical problems, including graph sketching [21], sparse covariance estimation [22], lowrank approximation [23], and discrete inference [24]. Most related to our approach is the work in [25, 26, 13], which apply sparse graph codes with peelingstyle decoding algorithms to compressive sensing and phase retrieval problems. In our setting we need to handle a mixture distribution, which requires more sophisticated query design and novel unmixing algorithms that go beyond the standard peelingstyle decoding.
6.3 Combinatorial and Dimension Reduction Techniques
Our results demonstrate the power of strategic query and coding theoretic tools in mixture problems, and can be considered as efficient linear sketching of a mixture of sparse vectors. In this sense, our work is in line with recent work that make uses of combinatorial and dimension reduction techniques in highdimensional and large scale statistical problems. These techniques, such as localitysensitive hashing [27], sketching of convex optimization [28], and codingtheoretic methods [29], allow one to design highly efficient and robust algorithms applicable to computationally challenging datasets without compromising statistical accuracy.
7 Conclusions
We propose the MixedColoring algorithm as a query based learning algorithm for mixtures of sparse linear regressions. The design of the query vectors and the recovery algorithm are base sparse graph codes, and our scheme achieves order optimal sample and computational complexities in the noiseless case, and sublinear sample and time complexities in the presence of noise. Our experiments justified the theoretical results. In the noisy scenario, studying the Robust MixedColoring algorithm with more than two parameter vectors and obtain theoretical results for the continuous alphabet can be two important future directions.
References
 [1] M. Harville, “A framework for highlevel feedback to adaptive, perpixel, mixtureofgaussian background models,” in Computer Vision ECCV 2002. Springer, 2002, pp. 543–560.
 [2] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1, pp. 19–41, 2000.
 [3] A. Zhang, N. Fawaz, S. Ioannidis, and A. Montanari, “Guess who rated this movie: Identifying users through subspace clustering,” arXiv preprint arXiv:1208.1544, 2012.
 [4] R. De Veaux, “Mixtures of linear regressions,” Comp. Statistics & Data Analysis, vol. 8, no. 3, 1989.
 [5] E. Blackwell, C. F. M. de Leon, and G. E. Miller, “Applying mixed regression models to the analysis of repeatedmeasures data in psychosomatic medicine,” Psychosomatic Medicine, vol. 68, no. 6, 2006.
 [6] P. Deb and M. Holmes, “Estimates of use and costs of behavioural health care: a comparison of standard and finite mixture models,” Econometric Analysis of Health Data, pp. 87–99, 2002.
 [7] K. Viele and B. Tong, “Modeling with mixtures of linear regressions,” Statistics and Computing, vol. 12, no. 4, pp. 315–330, 2002.
 [8] R. Gallager, “Lowdensity paritycheck codes,” IRE Transactions on information theory, vol. 8, no. 1, pp. 21–28, 1962.
 [9] M. S. Lewicki, “A review of methods for spike sorting: the detection and classification of neural action potentials,” Network: Computation in Neural Systems, vol. 9, no. 4, pp. R53–R78, 1998.
 [10] R. Jansen, “A general mixture model for mapping quantitative trait loci by using molecular markers,” Theoretical and Applied Genetics, vol. 85, no. 23, pp. 252–260, 1992.
 [11] D. Yin, R. Pedarsani, X. Li, and K. Ramchandran, “Compressed sensing using sparsegraph codes for the continuousalphabet setting,” 54nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016.
 [12] T. Richardson and R. Urbanke, “The capacity of lowdensity paritycheck codes under messagepassing decoding,” IEEE Transactions on Information Theory, vol. 47, pp. 599–618, February 2001.
 [13] D. Yin, K. Lee, R. Pedarsani, and K. Ramchandran, “Fast and robust compressive phase retrieval with sparsegraph codes,” in IEEE International Symposium on Information Theory, 2015, pp. 2583–2587.
 [14] S. Balakrishnan, M. J. Wainwright, and B. Yu, “Statistical guarantees for the em algorithm: From population to samplebased analysis,” arXiv preprint:1408.2156, 2014.
 [15] X. Yi, C. Caramanis, and S. Sanghavi, “Alternating minimization for mixed linear regression,” in Proceedings of The 31st International Conference on Machine Learning, 2014, pp. 613–621.
 [16] N. Städler, P. Bühlmann, and S. Van De Geer, “penalization for mixture regression models,” Test, vol. 19, no. 2, pp. 209–256, 2010.
 [17] S. Faria and G. Soromenho, “Fitting mixtures of linear regressions,” Journal of Statistical Computation and Simulation, vol. 80, no. 2, pp. 201–225, 2010.
 [18] Y. Chen, X. Yi, and C. Caramanis, “A convex formulation for mixed regression with two components: Minimax optimal rates,” arXiv preprint arXiv:1312.7006, 2013.
 [19] A. T. Chaganty and P. Liang, “Spectral experts for estimating mixtures of linear regressions,” in Proceedings of the 30th International Conference on Machine Learning (ICML13), 2013, pp. 1040–1048.
 [20] E. Arikan, “Channel polarization a method for constructing capacityachieving codes for symmetric binaryinput memoryless channels,” IEEE Transactions on Information Theory,, vol. 55, no. 7, 2009.
 [21] X. Li and K. Ramchandran, “An active learning framework using sparsegraph codes for sparse polynomials and graph sketching,” in Advances in Neural Information Processing Systems, 2015, pp. 2161–2169.
 [22] R. Pedarsani, K. Lee, and K. Ramchandran, “Sparse covariance estimation based on sparsegraph codes,” in Annual Allerton Conference on Communication, Control, and Computing, 2015.
 [23] S. Ubaru, A. Mazumdar, and Y. Saad, “Low rank approximation using error correcting coding matrices,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 702–710.
 [24] S. Ermon, C. Gomes, A. Sabharwal, and B. Selman, “Lowdensity parity constraints for hashingbased discrete integration,” in Proc. 31st International Conference on Machine Learning, 2014, pp. 271–279.
 [25] X. Li, S. Pawar, and K. Ramchandran, “Sublinear time support recovery for compressed sensing using sparsegraph codes,” arXiv preprint arXiv:1412.7646, 2014.
 [26] R. Pedarsani, K. Lee, and K. Ramchandran, “Phasecode: Fast and efficient compressive phase retrieval based on sparsegraphcodes,” arXiv preprint arXiv:1408.0034, 2014.
 [27] I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Nearest neighbor based greedy coordinate descent,” in Advances in Neural Information Processing Systems, 2011, pp. 2160–2168.
 [28] M. Pilanci and M. J. Wainwright, “Iterative hessian sketch: Fast and accurate solution approximation for constrained leastsquares,” arXiv preprint arXiv:1411.0347, 2014.
 [29] D. Achlioptas and P. Jiang, “Stochastic integration via errorcorrecting codes,” in Proc. Uncertainty in Artificial Intelligence, 2015.
 [30] W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space,” Contemporary mathematics, vol. 26, no. 189206, p. 1, 1984.
Appendices
Appendix A Details of Experiments
In this section, we provide more details of the experiments that we conducted. All simulations are done on a laptop with 2.8 GHz Intel Core i7 CPU and 16 GB memory using Python.
In Figure 5, we test the success probability and running time in the noiseless case. In both Figure 4(a) and Figure 4(b), we use for , for , for . In Figure 4(b), we use for , for , for .
In Table 2, we compare the sample and time complexities of the MixedColoring algorithm and the alternating minimization algorithm. We use , and . The parameters for alternating minimization are chosen as suggested in the original paper [15].
In Figure 6, we test the sample and time complexities of the Robust MixedColoring algorithm. In both Figure 5(a) and Figure 5(b), we choose quantization level , standard deviation of noise , algorithm parameters: , , number of singleton verification query vectors: . In Figure 5(a), we vary to find the minimum number of query vectors needed for successful recovery. In Figure 5(b), we fix and test the time cost.
In Figure 7, we test the performance of Robust MixedColoring algorithm with quantization assumption violated. We vary the number of bins to test the empirical probability of success, and also keep . Other parameters: , , quantization level , standard deviation of noise , number of singleton verification query vectors: , and .
Appendix B Proof of Theorem 1
b.1 Proof Outline
We prove Theorem 1 in this section. The proof includes two major steps: (i) show that the expectation of the fraction of nonzero elements which are not recovered can be arbitrarily small; (ii) show that this fraction concentrates around its mean with high probability. The first part mainly uses density evolution techniques which is commonly used in coding theory, and the second part uses Doob’s martingale argument.
b.2 Notations
We briefly recall the MixedColoring algorithm in the noiseless case and declare some notations that we will use for the rest of the proof.
Recall that the parameter vector has nonzero elements. We call these nonzero elements balls in color . We design a left regular bipartite graph with left nodes and right nodes, representing the coordinates and the bins, respectively. We denote the th bin by . We use the matrix to represent the biadjacency matrix of the bipartite graph, i.e., if and only if the th bin is associated with the th coordinate. Recall that we design three query vectors for in the form of (2), for the purpose of ratio test. The third query vectors is the summation of the first two and is used for summation check. We repeat the first two query vectors times, respectively, and get typeI and typeII index measurements. We repeat the third query vector times and get verification measurements. For the th verification measurement of the th bin, we define a subbin . If we can find one typeI index measurement and one typeII index measurement such that the summation of the two measurements is equal to the th verification measurement, we know that these three measurements are generated by the same parameter vector, say . The two index measurements are called a consistent pair. Then, we say that the subbin has color . We define the color set of . If we can find a consistent pair corresponding to the th verification measurement, we let , otherwise . We further define the color set of bin as .
b.3 Number of Singleton Balls
In this section, we analyze the number of singleton balls in color found in the first stage of the algorithm. We can show that this number is concentrated around a constant fraction of with high probability.
Lemma 1.
Let be the number of singleton balls in color found in the first stage. Then, there exists a constant^{3}^{3}3Recall that in our paper, constants are defined as quantities which do not depend on and . such that for any constant ,
(4) 
Proof.
We first specify some terminologies here. For a bin , we say that this bin has color when . One should notice that if there are more than one subbins in color in bin , these subbins are identical, and therefore, we can say that a bin is contains balls in color , when has at least one subbin in color , and the subbin is associated with nonzero elements in , or equivalently, the coded parameter vector satisfies , .
First, we analyze the probability that a particular bin has color . According to our model, the measurements are generated independently, therefore, we have
Then, we use to denote the probability of the event that a particular bin contains balls in color . Since each ball is associated with bins among the bins independently and uniformly at random, the number of balls in color that a bin contains is binomial distributed with parameters and , and we have
In addition, we can use Poisson distribution to approximate the binomial distribution when is a constant and approaches infinity. In the following analysis, we will use the approximation
Consider the bipartite graph representing the association between the balls in color and the bins. We know that there are edges connected to the balls in color , and we use to denote the expected fraction of these edges which are connected to a bin which contains balls in color , . Then, we have
and equivalently, is also the probability that an edge, which is chosen from the edges uniformly at random, is connected to a bin containing balls in color .
Let be the probability that a ball in color is a singleton ball. The event that this ball is a singleton ball is equivalent to the event that at least one of its associated bins contains one ball color . Then, when approaches infinity, we have
and this is because in the limit , the correlations between the edges connected to a ball become negligible; this technique is often used in the theoretical analysis of density evolution in coding theory, and we will use this type of asymptotic argument several times in the proofs. Let be the number of singleton balls in color , then we have . Using the asymptotic argument and by Hoeffding’s inequality, we also have for any constant ,
and this means that the number of singleton balls in color is highly concentrated around . ∎
b.4 Initial Fractions
We construct the graph whose nodes correspond to the singleton balls in color found in the previous stage, and analyze the number of edges in , which is equal to the number of strong doubletons in color . Then, we can show that the number of strong doubletons is concentrated around a constant fraction of with high probability.
Lemma 2.
Let be the number of strong doubletons in color found in the second stage. Then, there exists a constant such that for any constant ,
(5) 
Proof.
We know that the expected number of doubletons in color is . Then, we analyze the probability that a doubleton is a strong doubleton. Similar to the analysis in [26], for a particular ball in color , we let denote the event that this ball is in a singleton, and denote the event that this ball is in a doubleton. We have the conditional probability that a ball in a doubleton is also a singleton ball:
Then we know the probability that a doubleton is a strong doubleton is , and the expected number of strong doubletons in color is . Let and be the number of edges in graph . The expectation of is , and according to Hoeffding’s inequality, we have for any
meaning that the number of edges is highly concentrated around . ∎
Then, we get the following result on the size of the giant component of , using the asymptotic behavior of the ErdosRenyi random graphs.
Lemma 3.
Let be the size of the largest connected component (giant component) of . If the parameters of the MixedColoring algorithm satisfy
(6) 
then, for any constant , with probability , initial fraction of the balls in color which are recovered after the second stage satisfies
(7) 
where the constant is the unique solution of the equation
and other connected components in are of sizes .
Proof.
This result is a direct corollary of the asymptotic behavior of the ErdosRenyi random graphs,and we only give a brief proof here. First, we condition on the number of singleton balls that we find in the first stage, i.e., and the number of edges in , i.e., . By symmetry, we know that the edges are uniformly chosen from the possible edges. Therefore, the graph is an ErdosRenyi random graph. According to the results on the giant component of ErdosRenyi random graphs, we know that if the limit
then with probability , the size of the giant component of graph is linear in , and other connected components have sizes . By (4) and (5), we know that for any constant , the limit lies in the interval , with probability , for some constant . Then, we can get rid of the conditioning and complete the proof of Lemma 3. ∎
b.5 Treelike Assumption
By Lemma 3, we know that we can recover a constant fraction of the nonzero elements with probability . Then, we study the iterative decoding process. The analysis is based on density evolution, which is a common and powerful technique in coding theory. Similar to other codingtheoretic analysisour derivation of density evolution is based on a treelike assumption. Here, we state the treelike assumption first and provide the results on the probability that the treelike assumption holds.
As we have mentioned, the association between the balls in color (nonzero elements in ) and the bins can be represented by a left regular bipartite graph. We label the edges by an ordered pair of a ball and a bin , denoted by . We define the level neighborhood of , denoted by as the subgraph of all the edges and nodes on paths with length less than or equal to , which start from and the first edge of the paths are not [26]. We have the following results on the probability that is a tree, or equivalently, cyclefree, for a constant .
Lemma 4.
[26] For a fixed constant , is a tree with probability at least .
We conduct the density evolution analysis conditioned on the event that is a tree for an edge which is chosen from the edges uniformly at random. Then, we will take the complementary event into consideration and complete the analysis.
b.6 Analysis on the Density Evolution
Recall that in the first iteration, we find all the singletons, and in the second iteration, we find the strong doubletons and form the giant component. Let be the probability that at the th iteration of the learning algorithm, a ball in color , which is chosen from the balls uniformly at random, is not recovered, . Here, corresponds to the probability that after the second iteration, a randomly chosen ball in color is not in the giant component. According to the previous section, we know that by choosing parameters which satisfy (6), we have with probability . Now we analyze the relationship between and for .
Consider the iterative decoding process as a message passing process. First, we know that at iteration , a ball in color passes a message to a bin through an edge claiming that it is colored, if and only if at least one of the other neighborhood bins contains a resolvable multiton in color . Second, a subbin in color becomes a resolvable multiton if and only if all the other balls in this subbin are colored. This message passing process is illustrated in Figure 8. Under the treelike assumption, the messages passed among the balls and bins are independent, we have
which gives us
(8) 
As we can see, the major difference between the density evolution of the MixedColoring algorithm and the PhaseCode algorithm is that there is a constant probability that a bin has a subbin in color .
Next, we will show that after a constant number of iterations, can be arbitrarily small.
Lemma 5.
If we choose parameters satisfying
(9) 
then for any constant , there exists a constant , such that .
Proof.
Let , then we have . It is easy to see that , , and is a monotonically increasing function. We also have
We know that if there is
(10) 
then there exists at least one fixed point such that . We use to represent the largest fixed point of in . Now we argue that the fixed point can be made arbitrarily small by choosing proper parameters. Suppose that for a certain set of parameters and , the fixed point is , then if we keep and increase to , where is a constant, then we can see that the new fixed point is upper bounded by , and in this way, the fixed point can be made an arbitrarily small constant. As shown in [26], as long as we can choose parameters to make the fixed point , then, there exists a constant number of iterations , depending on , such that . ∎
Then, we can prove the following lemma showing that the number of uncolored balls in color is concentrated around with high probability.
Lemma 6.
Let be the number of uncolored balls in color after iterations. Then for any , there exists constant , such that when conditioned on the event that , and is large enough,
(11) 
(12) 
The proof of Lemma 6 is the same as in [26], and uses Doob’s martingale argument and Azuma’s concentration bound. We should also notice that the event that the treelike assumption does not hold is already considered in (11). Now combining Lemmas 3, 5, and 6, we have shown that for a specific color , there exists proper parameters of the algorithm such that after a constant number of iterations, the MixedColoring algorithm can recover an arbitrarily large fraction of the balls in color with probability . Using a union bound over all the colors ( is a constant), we have proved the results in Theorem 1 on the error probability.
b.7 Computational Complexity
In this part, we analyze the computational complexity of the algorithm. First, since there are bins and each bin has a constant number of subbins. Refining the measurements of each bin takes operations, the computational complexity of refining measurements is . Next, to find all the singletons, we need to check all the colored subbins, and checking each subbin takes operations, the computational complexity of this stage is . In the third stage, we find all the strong doubletons. We know that there are singleton balls and for each singleton ball, there are bins connected to it. For each of the bins, we subtract the measurements contributed by the singleton ball from the refined measurements in the subbins, and do the ratio test to see if it is a strong doubleton. Therefore, processing each bin takes operations and since is also a constant, the computational complexity of finding strong doubletons is also . Then, we get the graph with nodes and edges, corresponding to the singleton balls and strong doubletons, respectively. Using breadthfirst search algorithm, the computational complexity of finding the connected components is . In the last stage, we iteratively find other uncolored balls. For each unprocessed subbin, since we do not know the color of the subbin, there are possible remaining measurements. Each time when we find a new ball, we update at most remaining measurements and do the ratio test. Therefore, it takes operations when coloring a new ball. Since there are uncolored balls after finding the giant components, the computational complexity of the last stage is also . So far, we have shown that the computational complexity of MixedColoring algorithm is , which completes the proof of Theorem 1.
Appendix C Computing the Constants in the Sample Complexity
In this section, we give exact constants in the sample complexity results. For simplicity, we assume that and for all . We define a new notation , and then there is . We will analyze the minimum number of measurements that we need to reach a certain reliability target. More precisely, we set the maximum error floor to be , and numerically calculate the error floor for different values of , , , and . Then, we minimize the number of total measurements, which is proportional to with the constraint that the error floor . As we have shown in previous parts, the parameters should also satisfy (6) and (9). We know that if (6) is satisfied, when is large enough, there should be a giant component with size linear in for each color. where is a threshold that we can choose. Therefore, we select parameters with three constraints, which are (9), (6), and .
11  12  13  14  15  16  17  18  
6.7  8.7  1.9  3.1  5.1  1.6  0.5  7.4  
2.95  3.17  3.23  3.46  3.71  3.78  3.86  4.37  
4  4  4  3  3  3  3  3  
4  3  3  4  3  3  3  2  
35.4  34.87  35.53  34.6  33.39  34.02  34.74  34.96  
11  12  13  14  15  16  17  18  
4.4  5.2  2.7  9.2  8.8  2.8  6.2  2.3  
1.94  2.08  2.17  2.39  2.52  2.56  2.76  2.81  
7  6  6  5  5  5  5  5  
7  7  6  6  5  5  4  4  
40.74  39.52  39.06  38.24  37.80  38.4  38.64  39.34  
11  12  13  14  15  16  17  18  
7.8  8.7  8.1  5.6  4.2  3.3  4.0  5.0  
1.48  1.59  1.68  1.76  1.85  1.93  2.04  2.16  
9  9  8  8  7  7  7  6  
11  8  8  7  8  7  6  7  
42.92  41.34  40.32  40.48  40.7  40.53  40.8  41.04 
The results of the numerical calculation are shown in Table 3. In these experiments, we set , , and we fix the left degree and choose different values of , , and to minimize the number of measurements with the three constraints. Then we compare the optimal number of measurements over different choices of and find the optimal . As we can see, to reach the same reliability level, for , the optimal number of measurements we need is , , and , respectively. The number of measurements we need only increases slightly with , and the optimal is around 13 and 15.
Appendix D Details of Noisy Recovery Algorithm
In this section, we provides more details to show we robustify the Mixedcoloring algorithm in presence of noise. The overall structure of the algorithm is the same as the noiseless case. However, one can see that the ratio test method that we use for indexing in the noiseless case and the summation check approach are both fragile to noise. Therefore, we need different design of query vectors. The main idea to robustify the algorithm is to encode the location information using binary representations, i.e., binary bits, rather than the relative phases. Similar methods have been used in problems such as compressive phase retrieval [13]. Further, instead of consistent pairs, we find consistent sets of measurements using consecutive summation check.
Design of Queries.
We still design the query vectors according to the left regular bipartite graph. For a particular bin, let denote the association between this bin and the coordinates. We design query vectors , for this bin as follows: