Decoding from Pooled Data: Phase Transitions of Message Passing
We consider the problem of decoding a discrete signal of categorical variables from the observation of several histograms of pooled subsets of it. We present an Approximate Message Passing (AMP) algorithm for recovering the signal in the random dense setting where each observed histogram involves a random subset of entries of size proportional to . We characterize the performance of the algorithm in the asymptotic regime where the number of observations tends to infinity proportionally to , by deriving the corresponding State Evolution (SE) equations and studying their dynamics. We initiate the analysis of the multi-dimensional SE dynamics by proving their convergence to a fixed point, along with some further properties of the iterates. The analysis reveals sharp phase transition phenomena where the behavior of AMP changes from exact recovery to weak correlation with the signal as crosses a threshold. We derive formulae for the threshold in some special cases and show that they accurately match experimental behavior.
Consider a discrete high-dimensional signal consisting of categorical variables, for example, nucleotides in a string of DNA or country of origin for a set of people. In many real-world settings, it is infeasible to observe the entire high-dimensional signal, for reasons of cost or privacy. Instead, in a manner akin to compressed sensing, observations can be obtained in the form of “histograms” or “frequency spectra”—pooled measurements counting the occurence of each category or type across subsets of the variables. Concretely, we investigate the so-called Histogram Query Problem (): a database consisting of a population of individuals, where each individual belongs to one category among , is queried. In each query, a subset of individuals is selected, and the histogram of their types, along with the individuals in that subset are revealed. Such a data acquisition model is common in applications such as the processing of genetic data, where DNA samples from multiple sources are pooled and analyzed together [[SBC02]]. This gives rise to the inferential problem of determining the category of every individual in the population. The question of interest in this paper is to determine the minimal number of observations needed for recovery, and to ascertain whether this inferential problem can be solved in an efficient manner.
1.1 The setting
Let be an assignment of variables to categories. We denote the queried subpopulations by , . Given subsets , the histogram of categories of the pooled subpopulation is denoted by , i.e., for all ,
We let denote the vector of proportions of assigned values; i.e., the empirical distribution of categories. We place ourselves in a random dense regime in which the sets are independent draws of a random set where independently for each , for some fixed . Meaning, at each query, the size of the pool is proportional to the size of the population: .
Here we adopt a linear-algebraic formulation which will be more convenient for the presentation of the algorithm. We can represent the map , which we refer to as the planted solution, as a set of vectors , for . Let represent the sensing matrix: , for all . The histogram equations (1) can be written in the form of a linear system of equations:
Our goal can thus be rephrased as that of inverting the linear system (2). Note that the problem becomes trivial if , since the square random matrix will be invertible with high probability. However, as we review in the next section, a detailed information-theoretic analysis of the problem shows that the planted solution is uniquely determined by the above linear system for , . In this paper we study the algorithmic problem in the regime , .
1.2 Prior work
The has recently been considered in [[WHLC16], [ERK16]]. Its study was initiated in [[WHLC16]] in the two settings where the sets are deterministic and random. We review the information-theoretic and algorithmic results known so far.
Under the condition that is the uniform distribution, Wang et al. [[WHLC16]] showed a lower bound on the minimum number of queries for the problem to be well-posed, namely, if then the set of collected histograms does not uniquely determine the planted solution . Further, under the condition that , they showed that with a constant independent of , suffices to uniquely determine . These results were later generalized and sharpened in [[ERK16]], where it was shown that for arbitrary and , measurements are necessary and sufficient for to be unique, where , and is “essentially” (see [[ERK16]] for the precise formula), being the Shannon entropy function.
In the deterministic setting, where one is allowed to design the sensing matrix , i.e. choose the pools at each query, Wang et al. [[WHLC16]] provided a querying strategy that recovers provided that , where is an absolute constant. Ignoring the dependence on , this almost matches the information-theoretic limit. The random setting has not been treated so far, and is the subject of the present paper.
We present an Approximate Message Passing (AMP) algorithm for the random dense setting, where each query involves a random subset of individuals of size proportional to . We characterize the exact asymptotic behavior of the algorithm in the limit of large number of individuals and a proportionally large number of queries , i.e. . This is done by heuristically deriving the corresponding State Evolution (SE) equations corresponding to the AMP algorithm. Then, a rigorous analysis of the SE dynamics reveals a rich and interesting behavior; namely the existence of phase transition phenomena in the parameters of the problem, due to which the behavior of AMP changes radically, from exact recovery to very weak correlation with the planted solution. We exactly locate these phase transitions in simple situations, such as the binary case , the symmetric case , and the general case under the condition that the SE iteration is initialized from a special point. The latter exhibits an intriguing phenomenon: the existence of not one, but an entire sequence of thresholds in the parameter that rules the behavior of the SE dynamics. These thresholds correspond to sharp changes in the structure of the covariance matrix of the estimates output by AMP. We expect this phenomenon to be generic beyond the special initialization case studied here. Beyond the precise characterization of the phase transition thresholds in these special cases, we initiate the study of State Evolution in a multivariate setting by proving the convergence of the full-dimensional SE iteration, when initialized from a “far enough” point, to a fixed point, and show further properties of the iterate sequence. This paper is intended to be a sequel to the information-theoretic study conducted in [[ERK16]].
2 Approximate Message Passing and State Evolution
In this section we present the Approximate Message Passing (AMP) algorithm and the corresponding State Evolution (SE) equations.
2.1 The AMP algorithm
The AMP algorithm [[DMM09]], known as the Thouless-Anderson-Palmer equations in the statistical physics literature [[TAP77]], can be derived from Belief Propagation (BP) on the factor graph modeling the recovery problem. The latter is a bipartite graph of vertices. The variables constitute one side of the bipartition, and the observations constitute the other side. The adjacency structure is encoded in the sensing matrix . Endowing each edge with two messages , being the probability simplex, one can write the self-consistency equations for the messages at each node by enforcing the histogram constraints at each observation (or check) node while treating the incoming messages as probabilistically independent in the marginalization operation. The iterative version of these self-consistency equations is the BP algorithm. BP is further simplified to AMP by exploiting the fact that the factor graph is random and dense, i.e. one only needs to track the average of the messages incoming to each node. This reduces the number of passed messages from to . For the present -variate problem, the algorithm we present is a special case of Hybrid-GAMP of [[RFGS12]]. We let and be the centered and rescaled data, and assume that the parameters and are known to the algorithm. The AMP algorithm reads as follows: At iteration , we update the check nodes as
and then update the variable nodes as
where is a normalization factor so that the entries of sum to one. The map plays the role of a “thresholding function” with a matrix parameter that is adaptively tuned by the algorithm. One should compare this situation to the case of sparse estimation [[DMM09]] where the soft thresholding function is used. Here, the form taken by is adapted to the structure of the signal we seek to recover. The variables and represent estimates of the histogram and their variances. The variables and are estimators of the planted solution and their variances before thresholding, while and are the posterior estimates of and its variance, i.e., after thresholding. The algorithm can be initialized in a “non-informative” way by setting for all , and and for all for example. We defer the details of the derivation to Appendix B.
2.2 State Evolution
State Evolution (SE) [[DMM09], [BLM12]], known as the cavity method in statistical physics [[MPV90]], allows us to exactly characterize the asymptotic behavior of AMP at each time step , by tracking the evolution in time of the relevant order parameters of the algorithm. More precisely, let
The matrix tracks the average alignment of the estimates with the true solution, and their average covariance structure. The SE equations relate the values of these order parameters at to those at time in the limit , . We let and denote the respective limits of and , which we assume exist in this “replica-symmetric” regime, and let . The SE equations read
with . The matrix is the covariance matrix of the error of the estimates output by AMP at time , and can be interpreted as the average covariance matrix of the estimates themselves. Note that the parameter has disappeared from the characterization by the SE equations, just as in the information theoretic study [[ERK16]].
The full derivation of these equations is relegated to Appendix C. The main hypothesis behind the derivation, which we do not rigorously verify, is that the variables are asymptotically Gaussian, centered about and with covariance : the measure converges weakly to . We refer to [[BM11], [BLM12]] for rigorous results, the assumptions of which do not apply to this setting. It is an interesting problem to prove the exactness of the SE equations in this setting.
2.3 Simplification of SE
Here we simplify the system of SE equations above to a single iteration. This crucially relies on the following Proposition:
If , then for all we have
. In particular, is a symmetric PSD matrix, and .
The proof of the above proposition is deferred to Appendix A. We pause to make a few remarks. The assumption of the Proposition could be enforced for example by setting the initial estimates of AMP as for all . This yields , and hence . The statements in the Proposition —together referred to as the Nishimori identities in the statistical physics literature [[ZK16]]— simplify the SE equations to a single iteration on . To succinctly present this simplification, for , and , we let
Then, the SE equations can be seen to boil down to the single equation
where, recalling that , we define
where equations (5) and (6) correspond to substituting the value of and into statement (ii) of the above proposition, while the last equality (7) is just a consequence of the first two, (5) and (6). Furthermore, via elementary algebra, the coordinates of the vector can written as
2.4 The mean squared & 0-1 errors
We can measure the performance of AMP by the mean squared error of the estimates :
Since , an alternative measure of performance would be the expected 0-1 distance between a random category drawn from the multinomial and the true category , then averaged over . This error would be written as
On the other hand, the MSE in the large limit reads
so the two notions of error coincide in the limit. Note that the MSE at each step can be deduced from SE iterate at time : .
3 Analysis of the State Evolution dynamics
In this section we present our main results on the convergence of the SE iteration (4) to a fixed point, and the location of the phase transition thresholds in three special cases. We start by analyzing the SE map and present some important generic results.
3.1 Analysis of the SE map
From expression (7), we see that the map sends the positive semi-definite (PSD) cone to itself. As written, is only defined for invertible matrices , but it could be extended by continuity to singular matrices: if is in the null space of , we declare that . This convention is consistent with the limiting value of a sequence where is a sequence of invertible matrices approaching . This also has an interpretation based on an analogy with electrical circuits, which we discuss shortly. This extension will be also denoted by . It is continuous over the , and we have . Now, we state an important property of , namely that it is monotone:
The map is order-preserving on ; i.e., for all , if then .
The proof of this Proposition is conceptually simple but technical, and is thus deferred to Appendix A. Next, we adopt a combinatorial view of the structure of the SE dynamics. This will help us identity subspaces of that are left invariant by . Note that the definition of involves acting on . Additionally, it is easy to verify that for all , , and for all . Therefore, without loss of generality, we can restrict the study of the state evolution iteration to the set
since it is invariant under the dynamics. The set can be seen as the set of Laplacian matrices of weighted graphs on vertices (every edge is weighted by for ). Hence can be seen as a transformation on weighted graphs. This transformation enjoys the following invariance property:
For all , preserves the connected component structure of the graph represented by ; i.e, two distinct connected components of the graph whose Laplacian matrix is remain distinct when transformed by .
The proof relies on the concept of effective resistance. One can view a graph of Laplacian as a network of resistors with resistances . The effective resistance of an edge is the resistance of the entire network when one unit of current is injected at and collected at (or vice-versa). Its expression is a simple consequence of Kirchhoff’s law, and is equal to (see e.g. [[Spi]]). It is clear that the effective resistance of an edge is finite if and only if both its endpoints belong to the same connected component of the graph, otherwise , and . This causes to “factor” across connected components, and thus acts on them independently.
Next, let us look at the limit of for large . For invertible on , we have , since almost surely. More generally, if represents a graph with connected components, only if are in the same component. Hence, , where is the orthogonal projector onto the span of the coordinates in where , and we have
Indeed, since for all , we have by monotonicity of . Letting settles the claim. In particular, with , , and for all representing a connected graph (i.e. ), we have . We are now ready to state the main result of this subsection.
Let satisfy the conditions of the Theorem. Using and observation (10), we have . By monotonicity of , we deduce that the SE iterates form a monotone sequence: for all . Since for all , then this sequence must have a limit
We expect that for large enough, , meaning that and . This situation corresponds to perfect recovery of the planted solution by AMP. We can easily show that this is the case for
If then the SE iterates converge to for every initial point. It is currently unclear to us whether this condition is also necessary. Instead, we consider three special cases and exactly locate the phase transitions thresholds.
3.2 The binary case
In this section we treat the case , which is akin to a noiseless version of the CDMA problem [[GV05]] or the problem of compressed sensing with binary prior. In this case, the SE iteration becomes one-dimensional. Indeed, we have , with . And since this space is invariant under , the latter can be parameterized by one scalar function , defined by
Next, we compute . For , we have . Then, letting , using (6) we have
Letting , for all , the SE reduces to
The function is continuous, increasing on , and bounded (since ). Moreover, . Therefore, the sequence (13) converges to zero for all initial conditions if and only if for all , i.e.
By a change of variables , one can also write this threshold as
If , then a new stable fixed point appears and the sequence converges to it for all initial conditions , and the asymptotic MSE of the AMP algorithm is .
Figure 1 demonstrates the accuracy of the above theoretical predictions — the predicted MSE by State Evolution matches the empirical MSE of AMP on a random instance with , across the whole range of and .
3.3 The symmetric case
In this section we treat the symmetric case where all types have equal proportions: , and analyze the SE dynamics. In this situation, the half-line is stable under the application of the map , and the dynamics becomes one-dimensional if initialized on this half-line.
Assume . For all , we have
Let , and with . The matrix is the orthogonal projector on . Therefore, we have
Therefore for all ,
By permutation-invariance of the Gaussian distribution, we see that is constant on its off-diagonal entries, hence on its diagonal entries as well since . Writing , we have . Hence, with
Therefore, if the SE iteration is initialized on this half-line: , with , then for all , with
Just as in the binary case, the function is continuous, increasing and bounded with . Hence, we have convergence to zero for all initial condition if and only if for all , i.e.
Otherwise, it converges to a non-zero value for all initial conditions , and the asymptotic MSE of the AMP algorithm is . Using the change of variables , one can also write this threshold as
It is not straightforward to read off the magnitude of from the above expression. We provide a table of approximate values for several small values of :
For larger , an asymptotic expression for this threshold may be desirable. We prove the following in Appendix A:
There exist two constants such that when is large enough,
Furthermore, one can take , and .
3.4 The general case initialized with a matching
Here we consider the SE iteration in arbitrary dimension and with arbitrary proportions of types , but we initialize the dynamics from a special point that corresponds to a matching of the vertices : each edge present in the matching corresponds to its own connected component. This case reveals an interesting behavior which we suspect is generic regardless of the initialization: the existence of a sequence of thresholds ruling the behavior of the SE dynamics. Let be a matching on the set of vertices (not all vertices are necessarily part of the matching), and let be its Laplacian matrix, where edges are weighted by arbitrary positive numbers. By Proposition 3, “factors” across connected components, thus each edge in the matching will follow its own dynamics independently of the other edges. The edges not initially present in the matching remain inactive forever. For , we have , and
and therefore, using expression (6) and letting ,
Therefore, the SE iteration reduces to
for all vertices . Note that this iteration is essentially the same as the one in the binary case (3.2)-(13), where becomes and becomes . For each , the above iteration converges to the fixed point zero for every initial point if and only if
Here, we symmetrized the expression just as in the binary case (14). Arranging these thresholds as from largest to smallest we see that the fixed point of the SE iteration gains one non-zero edge at each as decreases from some large value to zero. Equivalently, gains a rank one component corresponding to the connected component constituted by that edge. It is an interesting problem to determine the behavior of the SE iteration and locate these thresholds, if they exist, beyond this simple matching case.
We presented an algorithm for decoding categorical variables of a signal from randomly pooled observations of it, and characterized its performance it terms of a state evolution equation. The analysis of this evolution revealed phase transition phenomena in the parameters of the problem that happen in the linear regime . These algorithmic results, combined with information-theoretic ones [[WHLC16], [ERK16]] leave a large region in parameter space () where the signal is identifiable but AMP fails at recovering it, hinting at a possible computational hardness in this structured signal recovery problem. This could have interesting applications in privacy-related considerations. Further, we proved the convergence of the SE dynamics to a fixed point. The analysis of the properties of this fixed point as a function of the parameters in the general case, together with rigorous proof of the exactness of the state evolution equations for this problem are interesting open problems.
Part of this work was performed when FK and LZ were visiting the Simons Institute for the Theory of Computing at UC Berkeley. FK acknowledges funding from the EU (FP/2007-2013/ERC grant agreement 307087-SPARCS). MJ acknowledges the support of the Mathematical Data Science program of the Office of Naval Research under grant number N00014-15-1-2670.
Appendix A Omitted proofs
a.1 Proof of Proposition 1
We proceed by induction. Now assume that and that . We prove that and then that is symmetric and .
The first step is to show that . By assumption, . Let us define,
The th coordinate of is