This paper derives fundamental limits associated with compressive classification of Gaussian mixture source models. In particular, we offer an asymptotic characterization of the behavior of the (upper bound to the) misclassification probability associated with the optimal Maximum-A-Posteriori (MAP) classifier that depends on quantities that are dual to the concepts of diversity gain and coding gain in multi-antenna communications. The diversity, which is shown to determine the rate at which the probability of misclassification decays in the low noise regime, is shown to depend on the geometry of the source, the geometry of the measurement system and their interplay. The measurement gain, which represents the counterpart of the coding gain, is also shown to depend on geometrical quantities. It is argued that the diversity order and the measurement gain also offer an optimization criterion to perform dictionary learning for compressive classification applications.
Classification of high dimensional signals is fundamental to the broad fields of signal processing and machine learning. The aim is to increase speed and reliability while reducing the complexity of discrimination. An approach that has attracted a great deal of current interest is Compressed Sensing (CS) which seeks to capture important attributes of high-dimensional sparse signals from a small set of linear projections. The observation [1, 2] that captured the imagination of the signal processing community is that it is possible to guarantee fidelity of reconstruction from random linear projections when the source signal exhibits sparsity with respect to some dictionary.
Within CS the challenge of signal reconstruction has attracted the greatest attention, but our focus is different. We are interested in detection rather than estimation, in problems such as hypothesis testing, pattern recognition and anomaly detection that can be viewed as instances of signal classification. It is also natural to employ compressive measurement here since it may be possible to discriminate between signal classes using only partial information about the source signal. The challenge now becomes that of designing measurements that ignore signal features with little discriminative power. In fact, we would argue that the compressive nature of CS makes the paradigm a better fit to classification than to reconstruction.
Compressive classification appears in the machine learning literature as feature extraction or supervised dimensionality reduction. Approaches based on geometrical characterizations of the source have been developed, some like linear discriminant analysis (LDA) and principal component analysis (PCA) just depending on second order statistics. Approaches based on higher-order statistics of the source have also been developed [3, 4, 5, 6, 7, 8, 9].
In this paper we derive fundamental limits on compressive classification by drawing on measures of operational relevance: the probability of misclassification. We assume that the source signal is described by a Gaussian Mixture Model (GMM) that has been already learned. This assumption is motivated in part by image processing where GMMs have been used very successfully to describe patches extracted from natural images . Our main contribution is a characterization of the probability of misclassification as a function of the geometry of the individual classes, their interplay and the number of measurements. We show that the fundamental limits of signal classification are determined by quantities that can be interpreted as the duals of quantities that determine the fundamental limits of multi-antenna communication systems. These quantities include the diversity order and the coding gain which characterize the error probability in multiple input multiple output (MIMO) systems in the regime of high signal-to-noise ratio (SNR) [11, 12]. We note that wireless communication involves classification rather than reconstruction since the aim is to discriminate the transmitted codeword from the alternative codewords.
We use the following notation: boldface upper-case letters denote matrices (), boldface lower-case letters denote column vectors () and italics denote scalars (); the context defines whether the quantities are deterministic or random. The symbol represents the identity matrix. The operators , , and represent the transpose operator, the rank operator, the determinant operator and the pseudo-determinant operator, respectively. The symbol denotes the natural logarithm. For reason of space, we relegate the mathematical proofs of our results to an upcoming journal paper .
Ii The Compressive Classification Problem
We consider a classification problem in the presence of compressive and noisy measurements. In particular, we use the standard measurement model given by:
where represents the measurement vector, represents the source vector, represents the measurement matrix and represents standard white Gaussian noise.
We take the measurement matrix to be such that its elements are drawn independently from a zero-mean Gaussian distribution with a certain fixed variance, which is common in various CS problems [1, 2]. We also take the source signal to follow the well-known GMM, which has been shown to lead to state-of-the-art results in various classification applications including hyper-spectral imaging and digit recognition . This model assumes that the source signal is drawn from one out of classes , with probability , and that the distribution of the source conditioned on is Gaussian with mean and (possibly rank-deficient) covariance matrix .
The objective is to produce an estimate of the true signal class given the measurement vector. The Maximum-A-Posteriori (MAP) classifier, which minimizes the probability of misclassification , produces the estimate given by:
where represents the a posteriori probability of class given the measurement vector and represents the probability density function of the measurement vector given the class .
We base the analysis – in line with the standard practice in multiple-antenna communications systems [11, 12] – on an upper bound to the probability of misclassification of the MAP classifier , rather than the exact probability of misclassification . We also base the analysis on two fundamental metrics that characterize the asymptotic performance of the upper bound to the probability of misclassification in the low noise regime, which is relevant to various emerging classification tasks . In particular, we define the diversity order of the measurement model in (1) as:
that determines the offset of (the upper bound to) the misclassification error probability at low noise levels. These quantities admit a counterpart in multiple-antenna communications – for example, the measurement gain corresponds to the standard coding gain. It turns out that the behavior of the upper bound to the misclassification probability mimics closely the behavior of the exact misclassification probability – as shown in the sequel – bearing witness to the value of the approach.
The characterization of the performance measures in (3) and (4) will be expressed via quantities that relate to the geometry of the measurement model, namely, the rank and the pseudo-determinant of certain matrices. In particular, we define the behavior of (3) and (4) via the geometry of the linear transformation of the source signal effected by the measurement “channel”, by using the quantities:
and , which measure the dimension and volume, respectively, of the sub-space spanned by the linear transformation of the signals in class ;
and , which measure the dimension and volume, respectively, of the union of sub-spaces spanned by the linear transformation of the signals in classes or .
, which relates to the dimension of the sub-space spanned by input signals in ;
, which relates to the dimension of the union of sub-spaces spanned by input signals in or .
We argue that this two-step approach casts further insight into the characteristics of the compressive classification problem, by allowing us to untangle in a systematic manner the effect of the measurement matrix and the effect of the source geometry.
Iii The Case of Two Classes
We now consider a two-class compressive classification problem. The Bhattacharyya bound, which represents a specialization of the Chernoff bound , leads to an upper bound to the probability of misclassification given by :
The Bhattacharyya based upper bound to the probability of misclassification encapsulated in (5) and (III) is the basis of the ensuing analysis. This analysis treats the case where the classes are zero-mean, i.e. , and the case where classes are non-zero mean, i.e. or , separately. The zero-mean case exhibits the main operational features of the compressive classification problem; the nonzero-mean case exhibits occasionally additional operational features, e.g. infinite diversity order.
Iii-a Zero-Mean Classes
The following Theorem offers a view of the asymptotic behavior of the probability of misclassification for the two-class compressive classification problem with zero-mean classes, by leveraging directly the geometry of the linear transformation of the source signal effected by the measurement “channel”.
The following Theorem now describes the asymptotic behavior of the probability of misclassification for the two-class compressive classification problem with zero-mean classes, by leveraging instead the geometry of the source signals. The result uses the fact that and, with probability 1, , and . The result also assumes, without loss of generality, that .
The characterization encapsulated in Theorem 1 admits a very simple interpretation:
if , then the sub-spaces spanned by the signals in classes 1 and 2 overlap completely – the upper bound to the misclassification probability exhibits an error floor because it is not possible to distinguish the classes perfectly as the noise level approaches zero;
if then the sub-spaces spanned by the signals in classes 1 and 2 do not overlap completely – the upper bound to the misclassification error probability (and the true error probability) then does not exhibit an error floor as it is possible to distinguish the classes perfectly as the noise level approaches zero. The lower the degree of overlap, the higher the diversity order – this is measured via the interplay of the various ranks, , and ;
the scenario is not possible in view of the geometry of the two-class problem.
On the other hand, the characterization encapsulated in Theorem 2 offers the means to articulate about the interplay between the number of measurements and the source geometry. Of particular importance:
if , the upper bound will exhibit an error floor a low noise levels; conversely, if and the upper bound will not exhibit such an error floor at low noise levels;
in addition, if additional measurements will have no impact on diversity order.
Overall, it is possible to argue that the diversity order is a function of the difference between the sub-spaces associated with the two classes, which is given by (9): by gradually increasing the number of measurements from 1 up to it is possible to extract the highest diversity level equal to (16); however, increasing the number of measurements past does not offer a higher diversity level – instead, it only translates into a higher measurement gain. One then understands the role of measurement as a way to probe the differences between the classes.
In contrast, the measurement gain is a function of the exact geometry of the classes in the Gaussian mixture model. It increases with the ratio of the product of the non-zero eigenvalues of to the product of the singular values of and .
We note that there is often flexibility in the definition of the properties of signal classes of a GMM, i.e. the dictionary . Measurement gain, and to a certain extent the diversity gain, can then provide an optimization criterion for dictionary design for compressive classification applications.
Iii-B Nonzero-mean classes
The following Theorem generalizes the description of the asymptotic behavior of the probability of misclassification from the zero-mean to the nonzero-mean two-class compressive classification problem.
Consider the measurement model in (1) where with probability and with probability and . If
then the upper bound to the probability of misclassification in (5) decays exponentially with as ; otherwise,
The characterization embodied in Theorem 3 illustrates that the asymptotic behavior of the upper bound of the error probability for classes with non-zero mean can be different from that for classes with zero mean. The differences in behavior trace back to the fact that represents a necessary condition for condition (17) to hold. In the non-zero mean case, choosing leads to a diversity order ; in contrast, in the zero-mean case choosing does not affect the diversity order. Letting induces the same diversity order both for nonzero-mean and zero-mean classes. The presence of the nonzero-mean here then impacts only the measurement gain.
One concludes that in the non-zero mean case, increasing the number of measurements past can have a dramatic effect on the performance. Geometrically, this result reflects the fact that, when embedded in a higher dimensional space ( in our cases), the affine spaces corresponding to the classes are separated when .
Iv The Case of Multiple Classes
We now consider a multiple class compressive classification problem, where . The generalization of the two-class results to the multiple-class case is possible by using the union bound in conjunction with the two-class Bhattacharyya bound.
The combination of the union bound with Bhattacharyya bound leads immediately to an upper bound to the probability of misclassification given by :
and is given by (III).
The fact that the form of the upper bound in (19) is akin to the form of the upper bound in (5), involving only in addition various pair-wise misclassification terms that capture the interaction between the different classes, leads to the immediate generalization of the results encapsulated in the previous Theorems.
In particular, we can argue that the upper bound to the misclassification probability will exhibit an error floor if at least one of the pair-wise misclassification probabilities also exhibits an error floor. Conversely, the misclassification probability will tend to zero as tends to zero if all the pairwise misclassification probabilities also tend to zero.
The diversity order of the multiple-class misclassification probability now corresponds to the lowest diversity order of the pairwise misclassification probabilities. Similarly, the measurement gain of the multiple-class misclassification probability corresponds to the measurement gain of the pairwise misclassification probability associated associated with the lowest diversity order. Therefore, it is possible to capitalize on the results embodied in Theorems 1, 2 and 3 to understand the behavior of the multiple-class compressive classification problem immediatly.
V Numerical Results
We now present a series of results that illustrate the main operational features of compressive classification of Gaussian mixture models. In particular, we consider three experiments where the exact covariance matrices and/or mean vectors of the classes have been generated randomly:
Figure 1 shows that, for , the upper bound exhibits an error floor, as expected. As increases () we can observe that i) it becomes possible to drive the upper bound to the misclassification probability to zero at low noise levels, and therefore perform classification without errors; and ii) the diversity gain also increases with . We can also verify that if , having more measurements does not increase the diversity gain, but yields a higher measurement gain.
In Figure 3 we can observe that the upper bound to the misclassification probability is, indeed, dominated by the behavior of the worst pair of classes (which in such scenario corresponds to the pair (), yielding a maximum diversity of , for ), by comparing with any other pair-wise upper bound, (e.g. the behavior of the pair () also depicted in the figure).
This paper studies fundamental limits in compressive classification of Gaussian mixture models. In particular, it is shown that the asymptotic behavior of (the upper bound to) the misclassification probability, which is intimately linked to the geometrical properties of the source and the measurement system, also captures well the behavior of the true misclassification probability. Moreover, it is recognized that the key quantities that determine the asymptotic behavior of the misclassification probability are akin to standard quantities used to characterize the behavior of the error probability in multiple-antenna communications: diversity and coding gain.
The practical relevance of the results – and beyond the theoretical insight – relates to the possibility of integrating the asymptotic characterizations with dictionary learning methods for compressive classification. The diversity order and the measurement gain – in view of its links to the geometry of the measurement system and the geometry of the source – offer a means to pose optimization problem that offer an opportunity to construct dictionaries with good discriminative power.
-  E. Candes, J. K. Romberg, and T. Tao, “Stable signal recovery for incomplete and inaccurate measurements,” Pure and Applied Mathematics, Communications on, vol. 59, no. 8, pp. 1207 –1223, august 2006.
-  D. Donoho, “Compressed sensing,” Information Theory, IEEE Transactions on, vol. 52, no. 4, pp. 1289 –1306, april 2006.
-  J. Wright, A. Y. Yang, A. Ganesh, S. Sankar Sastry, and Y. Ma, “Robust face recognition via sparse representation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 2, pp. 210 –227, february 2009.
-  L. Liu and P. Fieguth, “Texture classification from random features,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 3, pp. 574 –586, march 2012.
-  R. Calderbank and S. Jafarpour, “Finding needles in compressed haystacks,” in Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), march 2012, pp. 3441 –3444.
-  D. Erdogmus and J. C. Principe, “Lower and upper bounds for misclassification probability based on renyi’s information,” VLSI signal processing systems for signal, image and video technology, Journal of, vol. 37, no. 2 -3, pp. 305–317, 2004.
-  Z. Nenadic, “Information discriminant analysis: Feature extraction with an information-theoretic objective,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 8, pp. 1394 –1407, august 2007.
-  M. Chen, W. Carson, M. R. D. Rodrigues, R. Calderbank, and L. Carin, “Communication inspired linear discriminant analysis,” in Proceedings of the 29th International Conference on Machine Learning, june 2012.
-  W. R. Carson, M. R. D. Rodrigues, M. Chen, L. Carin, and R. Calderbank, “Communications-inspired projection design with application to compressive sensing,” Imaging Sciences, SIAM Journal on, vol. 5, no. 4, pp. 1185 –1212, 2012.
-  M. Chen, J. Silva, J. Paisley, C. Wang, D. Dunson, and L. Carin, “Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds,” Signal Processing, IEEE Transactions on, vol. 58, no. 12, pp. 6140 –6155, december 2010.
-  V. Tarokh, N. Seshadri, and A. Calderbank, “Space-time codes for high data rate wireless communication: performance criterion and code construction,” Information Theory, IEEE Transactions on, vol. 44, no. 2, pp. 744 –765, march 1998.
-  D. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge, U.K.: Cambridge University Press, 2005.
-  H. Reboredo, F. Renna, R. Calderbank, and M. R. D. Rodrigues, “Compressive classfication: Fundamental limits and applications,” in preparation.
-  R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd Edition). New York, NY: Wiley-Interscience, 2000.
-  L. Zheng and D. Tse, “Diversity and multiplexing: a fundamental tradeoff in multiple-antenna channels,” Information Theory, IEEE Transactions on, vol. 49, no. 5, pp. 1073 –1096, may 2003.
-  A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time Wireless Communications. Cambridge, U.K.: Cambridge University Press, 2008.
-  A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by their probability distributions,” Calcutta Mathematical Society, Bulletin of the, vol. 45, pp. 99 –109, 1943.
-  T. Wimalajeewa, H. Chen, and P. Varshney, “Performance limits of compressive sensing-based signal classification,” Signal Processing, IEEE Transactions on, vol. 60, no. 6, pp. 2758 –2770, june 2012.