Canonical correlation analysis of high-dimensional data with very small sample support
This paper is concerned with the analysis of correlation between two high-dimensional data sets when there are only few correlated signal components but the number of samples is very small, possibly much smaller than the dimensions of the data. In such a scenario, a principal component analysis (PCA) rank-reduction preprocessing step is commonly performed before applying canonical correlation analysis (CCA). We present simple, yet very effective approaches to the joint model-order selection of the number of dimensions that should be retained through the PCA step and the number of correlated signals. These approaches are based on reduced-rank versions of the Bartlett-Lawley hypothesis test and the minimum description length information-theoretic criterion. Simulation results show that the techniques perform well for very small sample sizes even in colored noise.
Bartlett-Lawley statistic, canonical correlation analysis, model-order selection, principal component analysis, small sample support.
Correlation analysis based on only small sample support is a challenging task yet with important applications in areas as diverse as biomedicine (e.g. [1, 2]), climate science (e.g. [3, 4]), array processing (e.g. ), and others. In this paper, we look at the scenario where the data sets have large dimensions but there are only few correlated signal components. Probably the most common way of analyzing correlation between two data sets is canonical correlation analysis (CCA) . In CCA, the observed data and are transformed into -dimensional internal (latent) representations and , where , using linear transformations described by the matrices and . The key idea is to determine and such that most of the correlation between and is captured in a low-dimensional subspace.
CCA proceeds as follows. First two vectors (“projectors”) and are determined such that the absolute value of the scalar correlation coefficient between the internal variables and is maximized. The internal variables constitute the first pair of canonical variables, and is called the first canonical correlation (coefficient). The next pair of canonical variables maximizes the absolute value of the scalar correlation coefficient (the second canonical correlation) between and , subject to the constraint that they are to be uncorrelated with the first pair. A total of correlations is determined in this manner, and , . CCA can be performed via the singular value decomposition of the coherence matrix 
where is the cross-covariance matrix between and , and and are the auto-covariance matrices of and . The canonical correlations are the singular values, which are the diagonal elements of the diagonal matrix . The transformations that generate the latent representations and are then described by and .
In practice, we do not know the covariance matrices and must estimate them from samples. If CCA is performed based on sample covariance matrices, it leads to sample canonical correlations . If the number of samples is not significantly larger than the dimensions and , these ’s can be extremely misleading as they are generally substantially overestimated. Indeed, if then sample canonical correlations are always identically one, which means that they do not carry any information at all about the true population canonical correlations . In order to avoid this, we perform a dimension-reduction preprocessing step before applying CCA. The most common type of preprocessing is principal component analysis (PCA). That is, instead of applying (1) directly to the sample covariance matrices, we first extract a reduced number of components from that account for a large fraction of the total variance in . Similarly, we extract components from that account for a large fraction of the total variance in . CCA is then performed on the components extracted from and . The necessity of a PCA step preceding CCA for small sample sizes was shown in  using random matrix theory tools. The paper , however, did not answer the critical question of how to determine and such that the estimated ’s best reflect the true population canonical correlations .
At the same time, a key question in any correlation analysis is how many correlated signals there are. If we had access to the population canonical correlations, we could simply count the number of nonzero ’s. Since we don’t, we need to estimate the number of correlated signals from the estimated ’s. This is a model-order selection problem. In this paper, we present approaches to jointly determine, for a PCA-CCA setup, the ranks and of the PCA step and the number of correlated signals based on extremely small sample support, with possibly less or even substantially less than . These approaches rely on the fact that, while and may be very large, the number of correlated signals is often small. However, a complicating factor of the PCA-CCA setup is that PCA is designed to extract components that account for most of the variance within one data set, but these components are not necessarily the ones that account for most of the correlation between two data sets.
In the literature, most of the work on model-order selection deals with either (i) determining the number of signals in a single data set [10, 11, 12] or (ii) the number of correlated signals between two data sets, but without a PCA step [13, 14, 15, 16, 17, 18, 19]. There is only little work on the joint model-order selection in a PCA-CCA setup, most of which is rather ad hoc [20, 21] and only  presents a systematic approach. However, none of these joint PCA-CCA techniques works in the sample-poor case. In the absence of any methodical approach in the sample-poor regime, it is common to use very simple rules of thumb such as “choose the PCA ranks such that a certain percentage (e.g., 70%) of the total variance/energy in each data set is retained” (see, e.g., ). Needless to say, such rules based on experience only work for specific scenarios.
In general, there are two main approaches to model-order selection: hypothesis tests and information-theoretic criteria. Hypothesis tests [13, 14] are usually series of binary generalized likelihood ratio tests (GLRTs). Starting at , they test whether the model has order (the null hypothesis) or order greater than (the alternative). If the null hypothesis is rejected, is incremented and a new test is run. This proceeds until the null hypothesis is not rejected or the maximum model order is reached. The disadvantage of hypothesis tests is that they require the subjective selection of a probability of false alarm. This can be avoided by using information-theoretic criteria (ICs) (e.g., ), which compute a score as a function of model order. This score is the difference between the likelihood for the observed data, which measures how well the model fits the observed data, and a penalty function. With increasing order there is an increasing number of free parameters, and so the model fit becomes better. In order to avoid overfitting, complex models are penalized by the penalty function, which increases with model order. The best trade-off is achieved when the difference of likelihood and penalty function is maximized. It should be noted that the GLRT and IC methods for model-order selection are actually closely linked —a fact that we will exploit, as well.
In this paper, we present approaches to the joint model-order selection in a PCA-CCA setup based on reduced-rank versions of both the Bartlett-Lawley hypothesis test and the minimum description length (MDL) IC . As far as we know, these are currently the only techniques capable of handling the combined PCA-CCA approach in the sample-poor regime. An early version of this paper was presented at ICASSP 2015 .
We would also like to contrast our work with so-called sparse CCA (e.g., [25, 26]). In sparse CCA, a sparsity constraint is placed on the projectors and , which means that each canonical variable or is a linear combination of only a few components in and , respectively. While sparse CCA was not proposed to deal with the sample-poor scenario, in principle it can be used as an alternative to PCA-CCA if there is a priori information that the projectors are sparse. However, in many scenarios of interest (e.g., the applications in biomedicine, climate science, and array processing cited above) there is no justification to assume sparse projectors. When applied to non-sparse problems, sparse CCA will not work well.
Our program for this paper is as follows. In Section II, we formulate the problem and illustrate the issues that arise when performing CCA based on very small sample sizes and how a combined PCA-CCA approach can address these. We present our approaches based on the hypothesis test in Section III and based on the MDL-IC in Section IV. Extensive simulation results are shown in Section V.
Ii Problem Formulation
We observe independent and identically distributed (i.i.d.) sample pairs , that are drawn from the two-channel measurement model
The signals and are jointly Gaussian with zero means and cross-covariance matrix
where is the unknown standard deviation of signal component , the unknown standard deviation of signal component , and the unknown correlation coefficient between and . Hence, the first components of and are correlated, whereas the next (, ) components are independent between and . The correlated components may be stronger or weaker than the independent components. Without loss of generality, we assume the auto-covariance matrices and to be diagonal. The matrices and as well as the dimensions , , and are deterministic but unknown. Without loss of generality, and are assumed to have full column-rank. With all these assumptions, is the th canonical correlation coefficient between and . The noise vectors and are independent of each other, independent of the signals, zero-mean Gaussian, and with unknown (arbitrary) covariance matrices.
Compared to the dimensions and (which may be very large), we assume that there are only few correlated signals and only few independent signals with variance larger than the correlated signals (but there can be many independent signals with variance smaller than the correlated signals). However, because we do not assume that the mixing matrices and in (2) are sparse, the cross-covariance matrix between the observed vectors and is not sparse and sparse CCA is generally not suitable for this scenario.
We collect the sample pairs in data matrices and , from which we compute the sample covariance matrices , , and . In the case of small sample support, the sample canonical correlations , , , computed from the sample covariance matrices can be extremely misleading. It has been shown in  that when , at least sample canonical correlations will be identically one regardless of the two-channel model that generates the data samples. In such a small sample scenario, the ’s cannot be used to infer the number of correlated signals. But even in the case with greater (but not substantially greater) than the sample ’s are generally significantly overestimated. This is shown in Fig. 1, which displays the sample canonical correlations for a model of dimension , with correlated components and independent components for different sample sizes . Even for , where the number of samples is ten times the dimension of the system, the ’s for are quite wrong, and it is impossible from visual inspection to determine the number of correlated components.
This motivates the use of a rank-reduction preprocessing step. The most common type of preprocessing is PCA, and a combined PCA-CCA approach is the setup that we consider in our paper. So let us investigate what effect rank reduction has on the estimated canonical correlations. The PCA step retains those and components in and , respectively, that account for most of their total variance. These components can be computed as follows. We first determine the singular value decompositions (SVDs) of the data matrices and . Then the reduced-rank PCA descriptions of and are
where denotes the matrix containing the first columns of , which are associated with the largest singular values of , and denotes the matrix containing the first columns of , which are associated with the largest singular values of . Now let , , and be the sample covariance matrices from the reduced-dimensional PCA descriptions. The corresponding estimated canonical correlations may be computed as the singular values of the reduced-dimensional sample coherence matrix, which is 
The thus computed canonical correlations , , , depend on the ranks and . As seen in (4), the th estimated canonical correlation can be found as the th largest singular value of , where and are the matrices of right singular vectors of and , respectively. To avoid defective unit canonical correlations, we must choose and . This, however, does not tell us what the optimum choices for and are such that the ’s are as close to the true canonical correlations as possible.
Intuitively, it seems that and should be chosen large enough to capture as much of the correlated signal components as possible without including too much noise. If the correlated components are weaker than some of the independent components, this will inevitably mean that the PCA preprocessing step must also keep those stronger independent components. On the other hand, if the correlated components are also the strongest components, it would be better if the PCA step got rid of the independent components. Hence, without noise, would ideally be chosen between and , and between and . With noise, the ranks and may also fall outside of these ranges, depending on the properties of the noise and the relative strengths of the signals.
It can be shown using Cauchy’s interlacing theorem (the result is presented as Lemma 1 in Appendix A) that increasing the ranks of the PCA steps increases every estimated canonical correlation coefficient. Hence, choosing too large an or will lead to estimated canonical correlations that are greater, possibly significantly greater, than the true canonical correlations. On the other hand, if and are not large enough, then the rank-reduced representation does not contain all of the correlated components, and thus the estimated canonical correlations can be too small.
These considerations can be illustrated by the following example, where and . There are correlated signals (each with variance 1.5) and independent signals (each with variance 5). Since the independent signals are stronger than the correlated signals (and the numbers and of independent signals in and are identical), we would expect to be the optimum rank for the PCA step. Indeed, Fig. 2 shows that choosing greater than 5 leads to ’s that are too large, whereas less than 5 leads to ’s that are too small. While the exact relationships depend on the variances of signal and noise components and the correlation coefficients, the principle observed here generalizes to other settings.
Iii Order selection based on hypothesis test
Iii-a Traditional test
In the case of sufficient number of samples, the traditional hypothesis test [13, 14] for determining the number of correlated components between and is a series of binary hypothesis tests. Starting with , it tests the null hypothesis : versus the alternative hypothesis : . If is rejected, is incremented and a new test is run. This proceeds until is not rejected or is reached.
The binary test in [13, 14] is a generalized likelihood ratio test (GLRT) of vs. . For a given number of correlated signals, let denote the parameter space of the model, which consists of the auto- and cross-covariance matrices. The maximum value of the log-likelihood function for a given number of correlated signals, maximized over the parameter space , is 
Canonical correlation coefficients close to 1 are strong evidence of correlation between and and thus lead to large . Now let denote the parameter space of all models where the assumed number of correlated signals is greater than . The generalized log-likelihood ratio for testing vs. is 
where the second identity follows from the fact that the maximum of the likelihood function, under the constraint , occurs when the model has the most degrees of freedom, i.e., for . The cross-covariance matrix has degrees of freedom . Wilks’ theorem  says that is asymptotically (as ) -distributed with degrees of freedom equal to the difference of the dimensions of the parameter spaces and :111In this difference, only the degrees of freedom associated with the cross-covariance matrix matter.
This correction makes the moments of the test statistic equal to the moments of the -distribution. As long as is large compared to and , the statistic is generally very close to a -distribution. Note that this is independent of the covariance matrix of the noise, since it is not used anywhere in the derivation. This allows computation of a test threshold for a given probability of false alarm.
Iii-B Test with PCA preprocessing
Instead of running the test directly on and , we would like to apply the test to the reduced-rank PCA descriptions and obtained in (3). By performing PCA on and , we create a new reduced-rank two-channel model:
In this model, the new matrices and have full rank because and are assumed to have full rank. With the PCA preprocessing the GLRT statistic is
and the Bartlett-Lawley statistic is
for with . The challenge in the reduced-rank version of the hypothesis test is thus to jointly determine the best ranks of the PCA steps and the number of correlated signals. As long as the number of samples is large compared to the minimum PCA dimension but and are not too small (which we will explain in the next paragraph), the new test statistic under is still approximately -distributed with degrees of freedom. We denote by the largest for which the -distribution holds well enough. Of course, requiring to be large with respect to is a much more relaxed condition than requiring to be large with respect to the dimensions and . This is because and do not have to be chosen greater (unless there are strong noise components) than and , respectively, which are usually much smaller than and .
There is, however, a complication. By applying PCA to and , we might eliminate some of the correlated components if the PCA ranks and are not chosen large enough. If this is the case, then the number of correlated components in the reduced-rank descriptions and will be smaller than the number of correlated components between and . As a consequence, will no longer resemble a -distribution. Instead, with will now be approximately . By choosing and not large enough it thus becomes likely that the null hypothesis “there are correlated signals” is not rejected, thus deciding for a smaller number than the true .
We are now getting closer to writing down a rule for jointly selecting , , and . In order to motivate this rule, we summarize the preceding discussion: Provided the PCA ranks and are chosen sufficiently large to capture all correlated components while is still small compared to , i.e., , the statistic in (11) is approximately (again irrespective of the noise covariance matrix). This means that in a series of binary tests of vs. (testing all values of starting from until is not rejected or the maximum is reached) would generally not be overestimated. It is likely, however, to be underestimated, if and are not chosen large enough. If and are too small, then the reduced-rank PCA descriptions do not capture all of the correlated components and thus the series of binary tests would decide for too small a . This reasoning motivates the following decision rule.
Detector 1 (“max-min detector”): Choose
and choose the and that lead to as the PCA ranks. In (12) the -operator chooses the smallest such that the statistic falls below the threshold , which ensures a given probability of false alarm. If there is no such , then it chooses . This step is similar to the traditional test, except that depends on and . The rule (12) is based on the fact that if and are not chosen optimally, the min-step might return a number smaller than . Hence, the min-step is performed for all and from 1 up to , and the maximum result is chosen as .
We will use an example to illustrate both the closeness of the -approximation and the idea of the max-min detector. We consider a scenario with , correlated signals, stronger interfering signals, and samples. The noise variance is chosen small compared to the signal variances. For and , Fig. 3 compares histograms of the statistic with the probability density function of a -distribution with degrees of freedom. As long as is large enough to capture all correlated components (which is the case for since the independent signals are stronger than the correlated signals) but small compared to , the statistic is very well approximated by the -distribution. This can be seen in subplots (b) and (c) (where we start to notice some divergence between the statistic and its approximation). Subplot (d) shows , which is not small enough with respect to . Here the -distribution is no longer a good approximation of the test statistic.
On the other hand, if then the PCA step eliminates some correlated components. This can be observed in subplot (a) for , where the histogram of does not approximate a -distribution. Because the PCA steps with keep the two stronger independent signals and only two of the three weaker correlated signals, the reduced-rank PCA descriptions and only have correlated signals rather than . It can be observed in Fig. 4 (b) that indeed well approximates a -distribution with degrees of freedom.
So let us look at how the max-min detector would proceed in this example. To illustrate this, we again consider Figs. 3 and 4, which compare histograms of with -distributions with degrees of freedom for (Fig. 3) and (Fig. 4). Also shown in these figures are the thresholds for a probability of false alarm . According to (12), for given and , the detector needs to find the minimum (between and ) such that the statistic falls below the threshold . Consider first , which is too small because the PCA steps eliminate one correlated component. From Fig. 4 (b), we see that it is likely that falls below , which means that for , the min-step of the detector would likely return too small a number of correlated signals ().222If we plotted the test statistics and thresholds also for and we would see that it is unlikely that a value would be chosen.
Now consider , which is large enough so that the PCA steps capture all correlated components. It can now be observed in Fig. 4 (c) that for , will likely not fall below .333As before, if we plotted the test statistics also for and , we would see that it is even less likely that falls below if . On the other hand, Fig. 3 (b) shows that it is likely that falls below , hence returning in the min-step of the detector.
Finally, consider , which is larger than needed to capture all correlated components. If and are too large then it becomes increasingly difficult, as can be observed in Fig. 2, to distinguish between the sample correlation coefficients that are associated with the correlated signals and those that are not. The min-step of the detector would still not generally overestimate (because the -approximation remains valid under ) but it might underestimate it. This becomes clear from looking at Fig. 4 (d), which shows that there is a rather high chance that the min-step would select . However, an underestimating min-step is not a problem for the max-min detector because it selects the maximum of all min-step results.
Iv Order selection based on information-theoretic criterion
A disadvantage of the hypothesis testing approach to order selection is the requirement of selecting a probability of false alarm . Setting too high will lead to a detector that tends to overfit, setting it too low will generally underfit. Achieving the best performance thus requires the right trade-off. In this section, we present two alternative approaches that do not require the manual selection of a threshold and are based on the minimum description length (MDL)-IC. The first approach will remain a hypothesis test but with automatic -selection exploiting a link between the GLRT and the IC for model-order selection. The second approach will be a max-min detector based directly on the MDL-IC.
Iv-a Setting the threshold based on the MDL-IC
The MDL-IC for selecting the number of correlated signals in two data sets (without PCA steps) is 
In this expression, the second term is the penalty term that depends on the degrees of freedom of the model444As before, only the degrees of freedom associated with the cross-covariance matrix are considered because the degrees of freedom associated with the auto-covariance matrices do not depend on . Hence, they do not matter in the following optimization problems. and thus penalizes overly complex models. The model order chosen is the value of for which is minimized. The reduced-rank version of (13), which accounts for the PCA steps, is
As has been noted in , there is the following connection between the MDL-IC and the log-likelihood ratio of the reduced-rank GLRT vs. :
with . When choosing between model orders and based on the MDL-IC, we decide for model order if . Because of (15) we can implement this decision rule also based on the GLRT. We decide for model order rather than a model order greater than if
The term on the right-hand side of this inequality is thus the threshold for the GLRT, which is determined based on the MDL-IC. Note that it is unnecessary to apply the Bartlett-Lawley correction because this would amount to multiplying both sides of the inequality (16) with the same factor. Thus, we obtain the following max-min decision rule in terms of rather than .
Detector 2 (max-min detector with threshold set by MDL-IC): Choose
Iv-B Min-MDL detector
Another approach that does not require the selection of applies the max-min idea directly to the MDL-IC. Let us first write down the decision rule and interpret it afterwards.
Detector 3 (“max-min MDL-IC detector”): Choose
and choose the and that lead to as the PCA ranks.
In order to understand this detector, we note, based on the discussion in the preceding subsection, that
The min-step in Detector 3 thus chooses the value of that maximizes the difference between the GLRT statistic and the MDL test threshold. This is different than the min-step in Detector 2, which picks the smallest for which the test statistic exceeds the threshold. Therefore, Detector 3 will never pick a smaller, but possibly larger, than Detector 2.
V Performance evaluation
In this section, we compare the performance of our three model-order selection schemes among each other and with competing approaches. In the absence of a competing systematic approach to the joint model-order selection in PCA-CCA, we determined the PCA ranks and separately from the number of correlated signals . We used the sample eigenvalue-based (SEV) technique  for selecting and because it is one of the few techniques that can handle the sample-poor case for a single channel. For the selection of we used the canonical correlation test (CCT)  with , the Akaike information criterion (AIC) , and the MDL criterion .
Figures 5–12 show the probability of selecting the correct for different setups. In the first setup, shown in Figs. 5–7, we consider a system with correlated signals (each with variance and correlation coefficients and ), and and independent signals (each with variance ). The matrices and are randomly generated unitary matrices. For each data point, we ran 1000 independent Monte Carlo trials.
We first consider a system with fixed dimension . In Fig. 5, we show the performance as a function of the number of samples when the noise is white and each noise component has unit variance. We see that the performance of Detector 1 depends on the choice of : For smaller , should be chosen larger, whereas for larger , a smaller performs better. Detector 2 does this trade-off automatically and performs very well even for very small sample sizes. All other approaches (including Detector 3) still perform quite well but require larger sample support.
The picture completely changes when we have colored rather than white noise. We now generate the noise from a spatially varying moving average (MA) process of order with coefficients . Before the spatial averaging, the noise components have variance . It can be seen in Fig. 6 that methods that select and separately from completely fail. This is because a single-channel technique such as SEV cannot distinguish between signal and noise eigenvalues if the noise is colored. The performance of our detectors, on the other hand, is actually improved particularly for very small sample sizes.
In Fig. 7, we reconsider the white noise case but with varying dimensions and fixed sample size . When the noise is independent in space and time, increasing the ratio of the data dimensions , to the number of samples shrinks the signal-subspace , which worsens the detection performance. We note, however, that the decrease in performance affects the SEV + X techniques much more than our detectors. Indeed, Detector 2 again shows a very reliable performance even for large dimensions. The main reason behind this effect is that the SEV technique is designed to keep all the signal components (i.e., correlated and independent components) whereas our detectors aim to eliminate weaker independent components in the PCA step. The presence of independent components deteriorates the detection performance of the subsequent CCA step.
So far, we have looked at a case where the correlated signals are stronger than the independent signals. We now investigate what happens when the correlated signals are weaker than some or all of the independent signals. First consider a scenario with 2 independent signals of varying variance and 7 correlated signals of variance 10. Figure 8 shows the probability of detection as a function of the independent signals’ variance. We see that the variance has only little effect on the performance of all techniques.
Now we increase the number of independent signals to 4, leaving all other settings unchanged. The most dramatic effect that can be observed in Fig. 9 is the failure of Detector 2 once the independent signals reach a variance close to the correlated signals’ variance. This may be explained as follows. Detector 2 sets its threshold based on MDL, which generally does not overestimate the number of correlated signals, but may underestimate it if the sample size is not sufficiently large compared to the system dimension (i.e., the PCA rank). In the case shown in Fig. 9, there are 7 correlated signals and 4 independent signals. Once the independent signals become as strong as or stronger than the correlated signals, this leads to an optimum PCA rank of 11. As the number of samples is not significantly larger than 11, MDL starts to underestimate the model order. This affects Detector 2 more severely than Detector 3 because Detector 3 will always return a model as large as, but possibly larger than, Detector 2 (see the discussion in Section IV-B).
This explanation can be validated by investigating the effect of the number of samples in a scenario where there are strong independent signals. We now consider a case with correlated signals with variance 8, and independent signals, two of which have variance 12 and 5 of which have variance 3. In Fig. 10, we look at the performance as a function of the number of samples . It can be observed that, among our three detectors, Detector 2 needs the largest number of samples for satisfactory performance, followed by Detector 3. The lesson that can be learned here is that in the presence of strong independent signals, Detector 1 should be preferred if only a very small number of samples are available.
Let us now investigate the effect that the value of the correlation coefficients among the correlated signals have. Here we consider a scenario with correlated signals with variance 8, and stronger independent signals of variance 10. In Fig. 11, we plot the performance as function of . The correlation coefficients for the 5 correlated signals are drawn from a uniform distribution between . As expected, stronger correlation leads to better performance. Since the independent signals are stronger than the correlated signals, Detector 1 outperforms Detector 3, which in turn outperforms Detector 2. All of our detectors outperform the competition.
In our last example we examine an array processing toy application to see what happens if the mixing matrices and become ill-conditioned. We consider two spatially separated uniform linear arrays (ULAs) with 40 sensors (i.e., ) and inter-sensor spacing of , which take samples. There are fixed point-sources in the far-field emitting narrow-band Gaussian signals at wavelength , which impinge upon ULA 1 at angles . Similarly, such signals impinge upon ULA 2 at angles . Two of these signals are correlated between ULAs 1 and 2 (i.e., , , ) with correlation coefficients and . The correlated signals each have variance and the independent signals each have variance . The noise is colored and generated as in the setup for Fig. 6.
With these assumptions, the th column of is , , and the th column of is