Structured Sparse Principal Components Analysis with the TVElastic Net penalty
Abstract
Principal component analysis (PCA) is an exploratory tool widely used in data analysis to uncover dominant patterns of variability within a population. Despite its ability to represent a data set in a lowdimensional space, PCA’s interpretability remains limited. Indeed, the components produced by PCA are often noisy or exhibit no visually meaningful patterns. Furthermore, their high density may also impede interpretation, unless arbitrary thresholding is used. However, in neuroimaging, it is essential to uncover clinically interpretable phenotypic markers that would account for the main variability in the brain images of a population. Recently, some alternatives to the standard PCA approach, such as Sparse PCA, have been proposed, their aim being to limit the density of the components. Nonetheless, sparsity alone does not entirely solve the interpretability problem in neuroimaging, since it may yield scattered and unstable components. We hypothesized that the incorporation of prior information regarding the structure of the data may lead to improved relevance and interpretability of brain patterns. We therefore present a simple extension of the popular PCA framework that adds structured sparsity penalties on the loading vectors in order to identify the few stable regions in the brain images accounting for most of the variability. Such structured sparsity can be obtained by combining e.g., and total variation (TV) penalties, where the TV regularization encodes higher order information about the structure of the data. This paper presents the structured sparse PCA (denoted SPCATV) optimization framework and its resolution. We demonstrate SPCATV’s efficiency and versatility on three different data sets. It can be applied to any kind of structured data, such as e.g., dimensional array images or meshes of cortical surfaces. The gains of SPCATV over unstructured approaches are significant, since SPCATV reveals the variability within a data set in the form of intelligible brain patterns that are easy to interpret, and are more stable across different samples.
I Introduction
Principal components analysis (PCA) is an unsupervised statistical procedure whose aim is to capture dominant patterns of variability in order to provide an optimal representation of a data set in a lowerdimensional space defined by the principal components (PCs). Given a data set of samples and centered variables, PCA aims to find the most accurate rank approximation of the data:
(1)  
where is the Frobenius norm of a matrix, are the loading vectors (right singular vectors) that define the new coordinate system where the original features are uncorrelated, is the diagonal matrix of the singular values, and are the projections of the original samples in the new coordinate system (called principal components (PCs) or left singular vector). Using components leads to the singular value decomposition (SVD). A vast majority of neuroimaging problems involve highdimensional feature spaces ( features i.e. voxels, mesh vertices) with a relatively limited sample size ( participants). With such “large , small ” problems, the SVD formulation, based on the data matrix, is much more efficient than an eigenvalue decomposition of the large covariance matrix.
In a neuroimaging context, our goal is to discover the phenotypic markers accounting for the main variability in a population’s brain images. For example, when considering structural images of patients that will convert to Alzheimer disease (AD), we are interested in revealing the brain patterns of atrophy explaining the variability in this population. This provides indications of possible stratification of the cohort into homogeneous subgroups that may be clinically similar but with a different pattern of atrophy. This could suggest different subtypes of patients with AD or some other etiologies such as dementia with Lewy bodies. Clustering methods might be natural approaches to address such situations, however, they can not reveal subtle differences that go beyond a global and trivial pattern of atrophy. Such patterns are usually captured by the first component of PCA which, after being removed, offers the possibility to identify spatial patterns on the subsequent components. However, PCA provides dense loading vectors (patterns), that cannot be used to identify brain markers without arbitrary thresholding.
Recently, some alternatives propose to add sparsity in this matrix factorization problem. The sparse dictionary learning framework proposed by [21] provides a sparse coding (rows of ) of samples through a sparse linear combination of dense basis elements (columns of ). However, the identification of biomarkers requires a sparse dictionary (columns of ). This is precisely the objective of Sparse PCA (SPCA) proposed in [17, 28, 10, 26, 18] which adds a sparsityinducing penalty on the columns of .
However, sparse PCA is limited by the fact that it ignores the inherent spatial correlation in the data. It leads to scattered patterns that are difficult to interpret. Furthermore, constraining only the number of features included in the PCs might not always be fully relevant since most data sets are expected to have a spatial structure. For instance, MRI data is naturally encoded on a grid; some voxels are neighbors, while others are not. We hypothesize that brain patterns are organized in regions rather than scattered across the brain. Recent studies tried to overcome this limitation by encoding prior information concerning the spatial structure of the data (see [16, 14, 25]). However, they used methods that are difficult to plug into the optimization scheme (e.g., spline smoothing, wavelet smoothing) and incorporated prior information that sometimes may be difficult to define. In data classification problems, when extracting structured and sparse predictive maps, the goals are largely aligned with those of PCA. Some classification studies have revealed stable and interpretable results by adding a total variation (TV) penalty to the sparsity constraint (see [13]). TV is widely used as a tool in image denoising and reconstruction. It accounts for the spatial structure of images.
We propose to extend the PCA of Equation 1 by adding structured sparsityinducing penalties (TV and ) to the minimization problem:
(2)  
where , and are hyperparameters controlling the relative strength of each penalty. We further propose a generic optimization framework that can combine any differentiable convex (penalized) loss function with: (i) penalties whose proximal operator is known (here ) and (ii) a large range of complex, nonsmooth convex structured penalties that can be formulated as a norm defined over a set of groups . Such grouppenalties cover e.g., total variation and overlapping group lasso. This new problem aims at finding a linear combination of original variables that points in directions explaining as much variance as possible in data while enforcing sparsity and structure (smoothness for TV) of the loadings.
To achieve this, it is necessary to sacrifice some of the explained variance as well as the orthogonality of both the loading and the principal components. Most existing SPCA algorithms [28, 10, 26, 18], do not impose orthogonal loading directions either. While we forced the components to have unit norm for visualization purposes, we do not, in this formulation, enforce . Instead, the value of is controlled by the hyperparameter . This penalty on the loading, together with the unit norm constraint on the component, prevents us from obtaining trivial solutions. The optional factor acts on and conveniently normalizes the loss to account for the number of samples in order to simplify the settings of the hyperparameters: .
This paper presents an extension of the popular PCA framework by adding structured sparsityinducing penalties on the loading vectors in order to identify the few stable regions in the brain images accounting for most of the variability. The addition of a prior that reflects the data’s structure within the learning process gives the paper a scope that goes beyond Sparse PCA. To our knowledge, very few papers ([1, 14, 16, 25]) addressed the use of structural constraint in PCA. Only one study, recently used the total variation prior ([1]), in a context of multisubject dictionary learning, based on a different optimization scheme ([5]).
Section II presents our main contribution: a simple optimization algorithm that combines well known methods (deflation scheme and alternate minimization) with an original continuation algorithm based on Nesterov’s smoothing technique. Our proposed algorithm has the ability to include the TV penalty, but many other nonsmooth penalties, such as e.g. overlapping group lasso, could also be used. This versatile mathematical framework is an essential feature in neuroimaging. Indeed, it enables a straightforward application to all kinds of data with known structure such as Ndimensional images (of voxels) or meshes of (cortical) surfaces. Section III demonstrates the relevance of structured sparsity on both simulated and experimental data, for structural and fMRI acquisitions. Structured sparsity achieved a higher reconstruction accuracy and more stable solutions than classical elastic net or Sparse PCA.
Ii Method
A common approach to solve the PCA problem, see [10, 18, 26]), is to compute a rank1 approximation of the data matrix, and then repeat this on the deflated matrix [19], where the influence of the PCs are successively extracted and discarded. We first detail the notations of the problem for a single component estimation (Section IIA), and its solution using an alternating minimization pipeline (Section IIB). Then, we develop the TV regularization framework (Section IIC and Section IID). Lastly, we will discuss the algorithm used to solve the minimization problem and its ability to converge toward stable pairs of components/loading vectors (Section IIE) and (Section IIF).
Iia Single component computation
Given a a pair of loading/component vectors, , the best rank1 approximation of the problem given in Equation I is equivalent [26] to:
(3)  
where is the penalized smooth (i.e. differentiable) loss, is a sparsityinducing penalty whose proximal operator is known and is a complex penalty on the structure of the input variables with an unknown proximal operator. This problem is convex in and in but not in .
IiB Alternating minimization of the biconvex problem
The objective function to minimize is biconvex [8]. The most common approach to solve a biconvex optimization problem (which does not guarantee global optimality of the solution) is to alternatively update and by fixing one of them at the time and solving the corresponding convex optimization problem on the other parameter vector.
On the one hand, when is fixed, the problem to solve is
(4)  
with the associated explicit solution
(5) 
On the other hand, solving the equation with respect to with a fixed presents a higher level of difficulty that will be discussed in Section IIE.
IiC Reformulating TV as a linear operator
Before discussing the minimization with respect to , we provide details on the encoding of the spatial structure within the penalty. It is essential to note that the algorithm is independent of the spatial structure of the data. All the structural information is encoded in a linear operator, , that is computed outside of the algorithm. Thus the algorithm has the ability to address various structured data and, most importantly, other penalties than just the TV penalty. The algorithm requires the setting of two parameters: (i) the linear operator ; (ii) a projection function detailed in Equation 12. This section presents the formulation and the design of in the specific case of a TV penalty on the loading vector measured on a 3dimensional image or a mesh of the cortical surface.
IiC1 3D image
The brain mask is used to establish a mapping between the coordinates in the 3D grid, and an index in the collapsed image. We extract the spatial neighborhood of , of size , corresponding to voxel and its 3 neighboring voxels, within the mask, in the and directions. By definition, we have
(6) 
The first order approximation of the spatial gradient is computed by applying the linear operator to the loading vector in the spatial neighborhood of , i.e.
(7) 
where is the loading coefficient at index in the collapsed image corresponding to voxel in the 3D image. Then is extended, using zeros, to a large but very sparse matrix in order to be directly applied on the full vector . If some neighbors lie outside the mask, the corresponding rows in are removed. Noticing that for TV there is one group per voxel in the mask (), we can reformulate TV from Equation 6 using a general expression:
(8) 
Finally, with a vertical concatenation of all the matrices, we obtain the full linear operator that will be used in Section IIE.
IiC2 Mesh of cortical surface
The linear operator used to compute a first order approximation of the spatial gradient can be obtained by examining the neighboring vertices of each vertex . With common triangletessellated surfaces, the neighborhood size is (including ). In this setting, we have , which can be extended and concatenated to obtain the full linear operator .
IiD Nesterov’s smoothing of the structured penalty
We consider the convex nonsmooth minimization of Equation 3 with respect to , where thus is fixed. This problem includes a general structured penalty, , that covers the specific case of TV. A widely used approach when dealing with nonsmooth problems is to use methods based on the proximal operator of the penalties. For the penalty alone, the proximal operator is analytically known and efficient iterative algorithms such as ISTA and FISTA are available (see [4]). However, since the proximal operator of the TV penalty is not known to have an analytical solution, standard implementation of those algorithms is not suitable. In order to overcome this barrier, we used Nesterov’s smoothing technique [22]. It consists of approximating the nonsmooth penalties for which the proximal operator is unknown (e.g., TV) with a smooth function (of which the gradient is known). Nonsmooth penalties with known proximal operators (e.g., ) are not affected. Hence, as described in [27], it allows to use an exact accelerated proximal gradient algorithm. Thus we can solve the PCA problem penalized by TV and elastic net, where an exact penalty is used.
Using the dual norm of the norm (which happens to be the norm again), Equation 8 can be reformulated as
(9) 
where is a vector of auxiliary variables, in the unit ball, associated with . As with which is the vertical concatenation of all the , we concatenate all the to form the . is the Cartesian product of 3D unit balls in Euclidean space and, therefore, a compact convex set. Equation 9 can further be written as
(10) 
Given this formulation of , we can apply Nesterov’s smoothing. For a given smoothing parameter, , the function is approximated by the smooth function
(11) 
for which . Nesterov [22] demonstrates this convergence using the inequality in Equation 15. The value of that maximizes Equation 11 is the concatenation of projections of vectors to the ball (): , where
(12) 
The function , i.e. the Nesterov’s smooth transform of , is convex and differentiable. Its gradient given by [22]
(13) 
which is Lipschitzcontinuous with constant
(14) 
where is the matrix spectral norm of . Moreover, Nesterov [22] provides the following inequality relating and
(15) 
where .
Thus, a new (smoothed) optimization problem, closely related to Equation 3 (with fixed ), arises from this regularization as
(16) 
Since we are now able to explicitly compute the gradient of the smooth part (Equation IID), its Lipschitz constant (Equation 19) and also the proximal operator of the nonsmooth part, we have all the ingredients necessary to solve this minimization function using an accelerated proximal gradient methods [4]. Given a starting point and a smoothing parameters , FISTA (Algorithm 1) minimizes the smoothed problem and reaches a prescribed precision .
However, in order to control the convergence of the algorithm (presented in Section IIE1), we introduce the Fenchel dual function and the corresponding dual gap of the objective function. The Fenchel duality requires the loss to be strongly convex, which is why we further reformulate Equation 16 slightly: All penalty terms are divided by and by using the following equivalent formulation for the loss, we obtain the minimization problem
L  (17) 
This new formulation of the smoothed objective function (noted ) preserves the decomposition of into a sum of a smooth term and a nonsmooth term . Such decomposition is required for the application of FISTA with Nesterov’s smoothing. Moreover, this formulation provides a decomposition of into a sum of a smooth loss and a penalty term required for the calculation of the gap presented in Section IIE1)
IiE Minimization of the loading vectors with CONESTA
The step size computed in Line 3 of Algorithm 1, depends on the smoothing parameter (see Equation 19). Hence, there is a tradeoff between speed and precision. Indeed, high precision, with a small , will lead to a slow convergence (small ). Conversely, poor precision (large ) will lead to rapid convergence (large ). Thus we propose a continuation approach (Algorithm 2) which decreases the smoothing parameter with respect to the distance to the minimum. On the one hand, when we are far from (the minimum of Equation 17), we can use a large to rapidly decrease the objective function. On the other hand, when we are close to , we need a small in order to obtain an accurate approximation of the original objective function.
IiE1 Duality gap
The distance to the unknown is estimated using the duality gap. Duality formulations are often used to control the achieved precision level when minimizing convex functions. They provide an estimation of the error , for any , when the minimum is unknown. The duality gap is the cornerstone of the CONESTA algorithm. Indeed, it is used three times:

In the th CONESTA iteration, as a way to estimate the current error (Line 7 in Algorithm 2). The error is estimated using the gap of the smoothed problem which avoid unnecessary computation since it has already been computed during the last iteration of FISTA. The inequality in Equation 15 is used to obtain the gap to the original nonsmoothed problem. The next desired precision and the smoothing parameter, , are derived from this value.
Based on Equation 17, which decomposes the smoothed objective function as a sum of a strongly convex loss and the penalty,
we compute the duality gap that provides an upper bound estimation of the error to the optimum. At any step of the algorithm, given the current primal and the dual variables [7], we can compute the duality gap using the Fenchel duality rules [20]:
(20) 
where and are respectively the Fenchel conjugates of and . Denoting by the minimum of (solution of Equation 17), the interest of the duality gap is that it provides an upper bound for the difference with the optimal value of the function. Moreover, it vanishes at the minimum:
(21) 
The dual variable is
(22) 
the Fenchel conjugate of the squared loss is
(23) 
In [15] the authors provide the expression of the Fenchel conjugate of the penalty :
(24) 
where .
The expression of the duality gap in Equation 20 provides an estimation of the distance to the minimum. This distance is geometrically decreased by a factor at the end of each continuation, and the decreased value defines the precision that should be reached by the next iteration (Line 8 of Algorithm 2). Thus, the algorithm dynamically generates a sequence of decreasing prescribed precisions . Such a scheme ensures the convergence [15] towards a globally desired final precision, , which is the only parameter that the user needs to provide.
IiE2 Determining the optimal smoothing parameter
Given the current prescribed precision , we need to compute an optimal smoothing parameter (Line 9 in Algorithm 2) that minimizes the number of FISTA iterations needed to achieve such precision when minimizing Equation 3 (with fixed ) via Equation 17 (i.e., such that ).
In [15], the authors provide the expression of this optimal smoothing parameter:
(25) 
where (Equation 15) and is the Lipschitz constant of the gradient of as defined in Equation 17.
We call the resulting algorithm CONESTA (short for COntinuation with NEsterov smoothing in a ShrinkageThresholding Algorithm). It is presented in detail, with convergence proofs in [15].
Let be the total number of FISTA loops used in CONESTA, then we have experimentally verified that the convergence rate to the solution of Equation 16 is (which is the optimal convergence rate for firstorder methods). Also, the algorithm works even if some of the weights or are zero, which thus allows us to solve e.g., the elastic net using CONESTA. Note that it has been rigorously proved that the continuation technique improves the convergence rate compared to the simple smoothing using a single value of . Indeed, it has been demonstrated in [6] (see also [27]) that the convergence rate obtained with single value of , even optimised, is . However, it has recently been proved in [15] that the CONESTA algorithm achieves a for general convex functions.
We note that CONESTA could easily be adapted to many other penalties. For example, to add the group lasso (GL) constraint to our structure, we just have to design a specific linear operator and concatenate it to the actual linear operator .
IiF The algorithm for the SPCATV problem
The computation of a single component through SPCATV can be achieved by combining CONESTA and Equation 5 within an alternating minimization loop. Mackey [19] demonstrated that further components can be efficiently obtained by incorporating this singleunit procedure in a deflation scheme as done in e.g. [10, 18]. The stopping criterion is defined as
(26) 
All the presented building blocks were combined into Algorithm 3 to solve the SPCATV problem.
Iii Experiments
We evaluated the performance of SPCATV using three experiments: One simulation study carried out on a synthetic data set and two on neuroimaging data sets. In order to compare the performance of SPCATV with existing Sparse PCA models, we also included results obtained with Sparse PCA and ElasticNet PCA. We used the scikitlearn implementation [23] for the Sparse PCA while we used the Parsimony package (https://github.com/neurospin/pylearnparsimony) for the ElasticNet PCA and the SPCATV methods. The number of parameters to set for each method is different: ElasticNet PCA requires the setting of the and the penalties weights. Meanwhile, SPCATV requires the settings of an additional parameter; the spatial constraint penalty . We operated a reparametrization of these penalty weights in ratios. A global parameter controls the weight attributed to the whole penalty term, including the TV and the regularization. Individual constraints are expressed in terms of ratios: the ratio: and the TV ratio : . For each of these methods, we tested several ranges of parameters, and retained the combination of parameters leading to the lowest reconstruction error obtained on an independent test set. However, in order to ensure that the components extracted have a minimum amount of sparsity, we also included a criteria controlling sparsity: At least half of the features of the second and third components have to be zero.
For both real neuroimaging experiments, performance was evaluated through a resampling procedure using a 5fold crossvalidation (5CV). However, for the synthetic data set, performance was evaluated on 50 different purposelygenerated data sets. In order to assess the reconstruction accuracy, we reported the mean Frobenius norm of the reconstruction error across the folds/data sets, on independent test data sets. The hypothesis we wanted to test was whether there was a substantial decrease in the reconstruction error when using SPCATV compared to when using Sparse PCA or ElasticNet PCA. It was tested through a related two samples test.
The TV penalty has a more important purpose than just to minimize the reconstruction error: the estimation of coherent and reproducible loadings. Indeed, clinicians expect that, if images from other patients with comparable clinical conditions had been used, the extracted loading vectors would have turned out to be similar. The stability of the loading vectors obtained across various training data sets (variation in the learning samples) was assessed through a similarity measure: the pairwise Dice index between loading vectors obtained with different folds/data sets [11]. We tested whether pairwise Dice indices are significantly higher in SPCATV compared to the Sparse PCA and ElasticNet PCA methods. Testing this hypothesis is equivalent to testing the sign of the difference of pairwise Dice indices between methods. However, since the pairwise Dice indices are not independent from one another (the folds share many of their learning samples), the direct significance measures are biased. We therefore used permutation testing to estimate empirical values. The null hypothesis was tested by simulating samples from the null distribution. We generated random permutations of the sign of the difference of pairwise Dice index between the PCA methods under comparisons, and then the statistics on the true data were compared with the ones obtained on the permuted ones to obtain empirical values.
Iiia Simulation study
We generated 50 sets of synthetic data, each composed of 500 images of size pixels. Images are generated using the following noisy linear system :
(27) 
where are sparse and structured loading vectors, illustrated in Figure 1. The support of defines the two upper dots, the support of defines the two lower dots, while ’s support delineates the middle dot. The coefficients that linearly combine the components of V are generated according to a centered Gaussian distribution. The elements of the noise vector are independent and identically distributed according to a centered Gaussian distribution with a 0.1 signaltonoise ratio (SNR). This SNR was selected by a previous calibration pipeline, where we tested the efficiency of data reconstruction at multiple SNR ranges running from 0 to 0.5. We decided to work with a 0.1 SNR because it is located in the range of values where standard PCA starts being less efficient in the recovery process.
We split the 500 artificial images into a test and a training set, with 250 images in each set and with the decomposition learned on the training set. The optimal number of components was selected according to the explained variance accounted for by each component. We decided to retain the information accounted for by the first three components, since the next PCs offers very little increase in the total explained variance. This choice seems reasonable in our attempt to recover the underlying support of the three loading vectors.
Figure 2 represents the loading vectors extracted with one data set. We observe that Sparse PCA yields very scattered loading vectors. The loading vectors of SPCATV, on the other hand, are sparse; but also organized in clear regions. SPCATV provides loading vectors that closely match the ground truth. The reconstruction error is evaluated on the test sets (Table I), with its value over the 50 data sets being significantly lower in SPCATV than in both Sparse PCA (, ) and ElasticNet PCA (, ) methods. A different way of quantifying the reconstruction accuracy for each method is to evaluate how closely the extracted loadings match the known ground truth of simulated data set. We computed the mean squared error (MSE) between the ground truth and the estimated loadings. The results are presented in Table I. We note that the MSE is significantly lower with SPCATV than with both Sparse PCA (, ) ElasticNet PCA (, ) .
Moreover, when evaluating the stability of the loading vectors across resampling, we found a higher statistically significant mean Dice index when using SPCATV compared to the other sparse methods (). The results are presented in Table I. They indicate that SPCATV is more robust to variation in the learning samples than the other sparse methods. SPCATV yields reproducible loading vectors across data sets.
Scores  

Methods  Reconstruction error  MSE  Dice Index 
Sparse PCA  1577.6***  0.90***  0.28*** 
ElasticNet PCA  1577.0***  0.83***  0.26*** 
SPCATV  1575.5  0.62  0.54 
These results indicate that the loadings are not only more stable across resampling but also achieve a better recovery of the underlying variability in data than the Sparse PCA and ElasticNet PCA methods.
IiiB 3D images of functional MRI of patients with schizophrenia
We then applied the methods on an fMRI data set composed of 23 patients experiencing hallucinations while lying in the scanner. We computed activation maps from the scans preceding hallucinations by regressing for each block the signal time course on a linear ramp function: Indeed, we hypothesized that activation in some regions presents a ramplike increase during the time preceding the onset of hallucinations. The activation maps that we used as an input to the SPCATV method are the statistical parametric maps associated to the coefficients of the block regression. We obtained a data set of maps and features. We hypothesized that the principal components extracted with SPCATV from these activation maps could uncover major trends of variability within prehallucination patterns. Thus, they might reveal the existence of subgroups of patients, according to the sensory modality involved during hallucinations. We decided to retrieve the first three components, accounting for 15% of the total variance. Since the next PCs offer very little increase in the total explained variance, we stopped retrieving components. The loading vectors extracted from the activation maps of prehallucinations scans with Sparse PCA and SPCATV are presented in Figure 3. We observe a similar behavior as in the synthetic example, namely that the loading vectors of Sparse PCA tend to be scattered and produce irregular patterns. However, PCATV seems to yield structured and smooth sources of variability, which can be interpreted clinically. Furthermore, the SPCATV loading vectors are not redundant and revealed different patterns.
Indeed, the loading vectors obtained by SPCATV are of great interest because they revealed insightful patterns of variability in the data: the second loading is composed of interesting areas such as the precuneus cortex and the cingulate gyrus, but also areas related to visionprocessing areas such as the occipital fusiform gyrus and the parietal operculum cortex regions. The third loading reveals important weights in the middle temporal gyrus, the parietal operculum cortex and the frontal pole. The first loading vectors is less informative since it appears to encompass all features, but could reveal a variability affecting the whole brain. These results seem to indicate the possible existence of subgroups of patients according to the hallucination modalities involved. Indeed, we might be able to distinguish patients with visual hallucinations from those suffering mainly from auditory hallucinations by looking at the score of the second component extracted by SPCATV.
The reconstruction error is significantly lower in SPCATV than in both Sparse PCA (, ) and ElasticNet PCA (, ). Moreover, when assessing the stability of the loading vectors across the folds, we found a higher statistically significant mean Dice index in SPCATV compared to the Sparse PCA () and compared to ElasticNet PCA ’s performance () as presented in Table II.
Scores  
Methods  Reconstruction error  Dice Index 
Sparse PCA  1548**  0.42*** 
ElasticNet PCA  1518**  0.57** 
SPCATV  1474  0.70 
In conclusion, SPCATV significantly outperforms Sparse PCA and ElasticNet PCA in terms of the reconstruction error, and in the sense that loading vectors are both more clinically interpretable and more stable.
We also evaluated the convergence speed of Sparse PCA, MiniBatch Sparse PCA (a variant of Sparse PCA that is faster but less accurate), ElasticNet PCA and SPCATV for this functional MRI data set of samples and features. We compared the time of execution required for each algorithm to achieve a given level of precision in Table III. Sparse PCA and ElasticNet PCA are similar in term of convergence time, while minibatch sparse PCA is much faster but does not converge to high precision. As expected, SPCATV takes longer than other sparse methods because of the inclusion of spatial constraints, but the convergence time is still reasonable for an fMRI data set with voxels.
Time to reach a given precision in seconds  
Methods  
Minibatch Sparse PCA  5.32         
Sparse PCA  158.0  231.2  344.3  386.8  450.1 
ElasticNet  123.7  138.1  302.7  396.4  406.3 
SPCATV  427.7  2958.6  8093.0  13813.4  14459.9 
IiiC 2D meshes of cortical thickness in Alzheimer disease
Finally, SPCATV was applied to the whole brain anatomical MRI from the ADNI database, the Alzheimer’s Disease Neuroimaging Initiative, (http://adni.loni.usc.edu/). The MR scans are T1weighted MR images acquired at according to the ADNI acquisition protocol. We selected 133 patients with a diagnosis of mild cognitive impairments (MCI) from the ADNI database who converted to AD within two years during the followup period. We used PCA to reveal patterns of atrophy explaining the variability in this population. This could provide indication of possible stratification of the population into more homogeneous subgroups, that may be clinically similar, but with different brain patterns. In order to demonstrate the relevance of using SPCATV to reveal variability in any kind of imaging data, we worked on meshes of cortical thickness. The features are the cortical thickness values at each vertex of the cortical surface. Cortical thickness represents a direct index of atrophy and thus is a potentially powerful candidate to assist in the diagnosis of Alzheimer’s disease ([3], [12]). Therefore, we hypothesized that applying SPCATV to the ADNI data set would reveal important sources of variability in cortical thickness measurements. Cortical thickness measures were performed with the FreeSurfer image analysis suite (Massachusetts General Hospital, Boston, MA, USA), which is documented and freely available for download online (http://surfer.nmr.mgh.harvard.edu/). The technical details of this procedure are described in [24], [9] and [2]. All the cortical thickness maps were registered onto the FreeSurfer common template (fsaverage). We decided to retrieve the first three components, accounting for 8% of the total variance. Since the next PCs offer very little increase in the total explained variance, we stopped retrieving components.
The loading vectors obtained from the data set with sparse PCA and SPCATV are presented in Figure 4. As expected, Sparse PCA loadings are not easily interpretable because the patterns are irregular and dispersed throughout the brain surface. On the contrary, SPCATV reveals structured and smooth clusters in relevant regions. The first loading vector, which maps the whole surface of the brain, can be interpreted as the variability between controls and Alzheimer converters, resulting from a global brain atrophy. The second loading vector includes variability in the entorhinal cortex, hippocampus and in temporal regions. And last, the third loading vector might be related to the atrophy of the frontal lobe and also involved variability in the precuneus. Thus, SPCATV provides a smooth map that closely matches the wellknown brain regions involved in Alzheimer’s disease.
Indeed, it is welldocumented that cortical atrophy progresses over three main stages in Alzheimer disease. The cortical structures are sequentially being affected because of the accumulation of amyloid plaques. Cortical atrophy is first observed, in the mild stage of the disease, in regions surrounding the hippocampus, as seen in the second component. Then, the disease progresses to a moderate stage; where atrophy gradually extends to the prefrontal association cortex as revealed in the third component. In the severe stage of the disease, the whole cortex is affected (first component). Therefore, SPCATV provides us with clear biomarkers, that are perfectly relevant in the scope of Alzheimer’s disease progression.
The reconstruction error is significantly lower in SPCATV than in both Sparse PCA (, ) and ElasticNet PCA (, ). The results are presented in Table IV. Moreover, when assessing the stability of the loading vectors across the folds, the mean Dice index is significantly higher in SPCATV than in both Sparse PCA () and ElasticNet PCA ().
Scores  
Methods  Reconstruction error  Dice Index 
Sparse PCA  2863***  0.57** 
ElasticNet PCA  2844***  0.64** 
SPCATV  2817  0.78 
Iv Conclusion
We proposed an extension of Sparse PCA that takes into account the spatial structure of the data. The optimization scheme is able to minimize any combination of the , , and TV penalties while preserving the exact penalty. We observe that SPCATV, in contrast to Sparse PCA and ElasticNet PCA, yields clinically interpretable results and reveals major sources of variability in data, by highlighting structured clusters of interest in the loading vectors. Furthermore, SPCATV ’s loading vectors were more stable across the learning samples compared to Sparse PCA. SPCATV was validated and its applicability was demonstrated on three distinct data sets: we may reach the conclusion that SPCATV can be used on diverse data, and is able to present structure within the data.
Supplementary Material
The ParsimonY Python library

Description: ParsimonY is Python library for structured and sparse machine learning. ParsimonY is opensource (BSD License) and compliant with scikitlearn API.
Data sets and scripts

Description: This url provides the simulation data set and the Python script used to create Fig.2 for the paper.
References
 [1] A. Abraham, E. Dohmatob, B. Thirion, D. Samaras, and G. Varoquaux. Extracting brain regions from rest fmri with totalvariation constrained dictionary learning. In Medical Image Computing and ComputerAssisted Intervention  MICCAI 2013  16th International Conference, Nagoya, Japan, 2013, Proceedings, Part II, pages 607–615, 2013.
 [2] Bruce B. Fischl, M. Sereno, and A. Dale. Cortical surfacebased analysis: Ii: Inflation, flattening, and a surfacebased coordinate system. NeuroImage, 9(2):195 – 207, 1999.
 [3] A. Bakkour, JC. Morris, and BC. Dickerson. The cortical signature of prodromal ad: regional thinning predicts mild ad dementia. Neurology, 72:1048–1055, 2009.
 [4] A. Beck and M. Teboulle. A Fast Iterative ShrinkageThresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
 [5] A. Beck and M. Teboulle. Fast gradientbased algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Image Processing, 18(11):2419–2434, 2009.
 [6] A. Beck and M. Teboulle. Smoothing and first order methods: A unified framework. SIAM Journal on Optimization, 22(2):557–580, 2012.
 [7] J.M. Borwein and A.S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and Examples. CMS Books in Mathematics. Springer, 2006.
 [8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
 [9] Anders Dale, Bruce Fischl, and Martin I. Sereno. Cortical surfacebased analysis: I. segmentation and surface reconstruction. NeuroImage, 9(2):179 – 194, 1999.
 [10] A. d’Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet. A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Review, 49(3):434–448, 2007.
 [11] L. Dice. Measures of the amount of ecologic association between species. Ecology, 26:297–302, 1945.
 [12] BC. Dickerson, E. Feczko, JC. Augustinack, J. Pacheco, JC. Morris, and B. Fischl. Differential effects of aging and alzheimer’s disease on medial temporal lobe cortical thickness and surface area. Neurobiology of aging, 30:432–440, 2009.
 [13] M. Dubois, F. HadjSelem, T. Löfstedt, M. Perrot, C. Fischer, V. Frouin, and É. Duchesnay. Predictive support recovery with TVElastic Net penalties and logistic regression: an application to structural MRI. In Proceedings of the fourth International Workshop on Pattern Recognition in Neuroimaging (PRNI 2014), 2014.
 [14] R. Guo, M. Ahn, H. Zhu, and the Alzheimerâs Disease Neuroimaging Initiative. Spatially weighted principal component analysis for imaging classification. Journal of Computational and Graphical Statistics, 24:274–296, 2015.
 [15] F. HadjSelem, T. Lofstedt, V. Frouin, V. Guillemot, and E. Duchesnay. An Iterative Smoothing Algorithm for Regression with Structured Sparsity. arXiv:1605.09658 [stat], 2016. arXiv: 1605.09658.
 [16] R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
 [17] I. Jolliffe, N. Trendafilov, and M. Uddin. A Modified Principal Component Technique Based on the LASSO. Journal of Computational and Graphical Statistics, 12(3):531–547, 2003.
 [18] M. JournÃ©e, Y. Nesterov, P. RichtÃ¡rik, and R. Sepulchre. Generalized Power Method for Sparse Principal Component Analysis. J. Mach. Learn. Res., 11:517–553, 2010.
 [19] Lester W. Mackey. Deflation Methods for Sparse PCA. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1017–1024. Curran Associates, Inc., 2009.
 [20] J. Mairal. Sparse coding for machine learning, image processing and computer vision. PhD thesis, École normale supérieure, Cachan, 2010.
 [21] J. Mairal, F. Bach, Jean J. Ponce, and G. Sapiro. Online Learning for Matrix Factorization and Sparse Coding. J. Mach. Learn. Res., 11:19–60, 2010.
 [22] Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103(1):127–152, 2005.
 [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [24] J.G Sled, A.P Zijdenbos, and A.C Evans. A nonparametric method for automatic correction of intensity nonuniformity in mri data. IEEE Trans Med Imaging, 17:87–97, 1998.
 [25] W.T. Wang and H.C. Huang. Regularized Principal Component Analysis for Spatial Data. ArXiv eprints, 2015.
 [26] D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009.
 [27] X.Chen, L. Qihang, K. Seyoung, J. Carbonell., and E. Xing. Smoothing proximal gradient method for general structured sparse regression. The Annals of Applied Statistics, 6(2):719–752, 2012.
 [28] H. Zou, T. Hastie, and R. Tibshirani. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, 2006.