Sparse hierarchical interaction learning with epigraphical projection
This work focus on regression optimization problem with hierarchical interactions between variables, which is beyond the additive models in the traditional linear regression. We investigate more specifically two different fashions encountered in the literature to deal with this problem: “hierNet” and structural-sparsity regularization, and study their connections. We propose a primal-dual proximal algorithm based on epigraphical projection to optimize a general formulation of this learning problem. The experimental setting first highlights the improvement of the proposed procedure compared to state-of-the-art methods based on fast iterative shrinkage-thresholding algorithm (i.e. FISTA) or alternating direction method of multipliers (i.e. ADMM) and second we provide fair comparisons between the different hierarchical penalizations. The experiments are conducted both on the synthetic and real data, and they clearly show that the proposed primal-dual proximal algorithm based on epigraphical projection is efficient and effective to solve and investigate the question of the hierarchical interaction learning problem.
Regression learning, hierarchical interaction, primal-dual algorithm, epigraphical projection
Learn interactions between features is of main interest in data processing such as for genomics [1, 2]. Graphical Lasso  is an efficient strategy to obtain a graph of interactions from the inverse covariance matrix but its limitation is to not being adapted for a specific task such as regression, which is the core of this contribution. To learn the interactions in the context of regression, additive sparse regression model has to be extended. Several work have been dedicated to this subject in a recent literature [4, 1, 5, 6, 7]. In this contribution we improve on these recent works by providing a common framework and an efficient minimization strategy based on epigraphical projection and primal-dual proximal algorithms.
One challenge for feature interaction learning is that the interaction number quadratically increases as the feature size, for example, 1000 features will have one million possible interactions, this therefore results into an over-fitting problem due to insufficient observation samples in most situations. To overcome this weakness, the regression weights and the quadratic interactions can be assumed to be sparse such that most features would not contribute to the decision. Sparsity-based regularization is known to be mainly suitable when the feature size is larger than the training samples. Various sparsity regularizations have been proposed and extensively studied in the additive model, for instance, involving -pseudo-norm , -norm , -norm , or structural sparsity norm . See  for an exhaustive list of sparse-based regularization in learning and  to refer to structural sparsity.
This work focuses on classical regression problem, which is still a subject of active research, for instance,  in order to reduce the computational cost of support vector regression (SVR) by considering the effective patterns in the training data according to the distribution of nearest neighbours; but also providing a comparisons between -norm SVR and sparse coding algorithm.
In this work, instead of linear regression, we include a quadratic behaviour in the regression learning to improve the performance. The idea to integrate quadratic interactions in the learning problem including discrimination task is not new, for example, discriminant quadratic learning [16, 17] learns the covariance matrix in the quadratic term to improve the discrimination ability. However, the idea to learn covariance matrix combined with sparsity regularization is more recent with the preliminary work by Zhao et al.  and the following works [1, 5, 6, 7]. Formally, if we denote the regression training set , the criterion involved in sparse quadratic interactions learning takes the form
where models the regression weight vector, denotes the matrix of interactions, and is the regularization term including sparse regularization and other constraints such as hierarchy constraints.
Our work focusses on feature selection, which aims to remove the redundant features and only keep the informative ones. The relevant features are usually correlated and belong to the same group. Many different works have been done in this respect, for instance, multilayer perceptron (MLP) neural network with “controlled redundancy”  is applied to feature selection. Xu et al.  proposed a semisupervised feature selection by max-relevance and min-redundancy criterion, as a result, the selected features are strong relevance to the labels under supervised manner and avoid redundancy under unsupervised manner. Feature selection by ranking with SVM with a sparse regularization is proposed in , and a reweighted scheme is apply to address nonconvex regularization. In this work, we learn the interactions between the selected features by making some hierarchy constraints, where the interactions only occur between the features, at least one of which is selected (i.e. is not zero).
The hierarchy constraints make the relationship between and , as shown in Fig. 1. Two types of hierarchy constraints have been defined in “hierNet” : (i) weak hierarchy where the interactions happen (i.e., ) if one of the associating main features in is non-zero (i.e., or ) and (ii) strong hierarchy when both associating features are non-zero. Both regularizations can be written as
where the notation and denotes the indicator function
where and . The contribution on w.r.t. can be (i) reduced to a -norm with (i.e., ) or (ii) considered globally by keeping the information contained in all the interaction coefficients considering . Most of the state-of-the-art penalization may be interpreted as a specific case of this general setting, for instance:
Note that  focus on weak hierarchy using .  is encountered under the name “hierNet”, while  is named “FAMILY” and  is “Type-A GRESH”. The latent overlapping group lasso formulation of  is known as “glinternet” .
From the algorithmic point of view several strategies can be encountered. In  and , the iterations are derived from FISTA . This procedure appears to be very efficient when while for it requires inner iterations based on Dykstra algorithm that slower significantly the convergence . On the other hand,  and  resort to ADMM to deal with .
Using recent development of non-smooth convex optimization, the objective of this work is to provide a unified framework and an efficient algorithmic strategy based on primal-dual proximal algorithms and epigraphical projection to solve (2) when , with and , i.e.
The algorithmic procedure does not resort to inner iterations leading to faster algorithmic procedure.
Following arguments provided in , Section 2 derives an equivalent writing of (4) using epigraphical formulation and recall the relationship with . Section 3 presents the primal-dual proximal algorithm iterations and the convergence guarantees. Its specification for solving the minimization problem (1) with regularization function (4) is specified and new epigraphical projections are provided. Section 4 evaluates the performance of the proposed strategy compared to FISTA and ADMM formulations proposed in [21, 1] both on synthetic data and real data. Based on the new algorithmic procedure, fair comparisons between a penalization involving or can be provided. Finally Section 5 gives our conclusion.
2 Epigraphical formulation
The minimization of (1) with the regularization term defined as in Eq. (4) is difficulty, since it involves non-smooth convex functions and symmetry constraint. In , Jenatton et al. solve the problem for in the weak hierarchy formulation by means of proximal algorithm. However, their algorithm is hardly adaptive to strong hierarchy. In , the authors reformulate (4) when under an epigraphical formulation, which represents the problem as a set of hard constraints. This reformulation helps in the interpretation and it highlights that a reduction in the shrinkage of certain main effects and an increase in the shrinkage of certain interactions. Its second advantage is to help in the design of the ADMM algorithmic scheme. The merit of this formulation is to split the problem as a set of convex constraints, and it allows us to deal with them by computing their projection onto convex sets. The following proposition gives the epigraphical formulation of our minimization problem.
The minimization problem
where is defined in (4) can be equivalently formulated as
where denotes the positive orthant and .
The constraints involving can be reformulated as
yielding to . Using , we get the expected result:
The algorithmic strategy designed in next section focus on the epigraphical formulation (6).
3 Primal-dual proximal algorithm
Proximal algorithms are derived from two main frameworks that are forward-backward scheme and Douglas-Rachford iterations (both being deduced from Krasnoselskii-Mann scheme) . FISTA can be presented as an accelerated version of forward-backward iterations [22, 25] while ADMM can be viewed as a Douglas-Rachford in the dual . In the literature dedicated to sparse regression the most encountered algorithm is FISTA when dealing with a sum of two convex functions where one is differentiable with a Lipschitz gradient. When the criterion involves more than two functions, typically additional constraint such as the constraint defined previously, most of the works derive an ADMM procedure or compute the proximity operator by mean of inner iterations which is known to often lead to an approximate solution even if global convergence can be obtained in specific cases . Another class of algorithmic procedures allowing to minimize a criterion with more that two functions, possibly including a differentiable function with a Lipschitz gradient, is the class of primal-dual proximal approaches [27, 28, 29, 30].
where denotes a bounded linear operator, where and being two Hilbert spaces, , and denote convex, l.s.c and proper functions. We additionally assume that is differentiable with a Lipschitz constant denoted . Iterations are summarized in Algorithm 1 involving the proximity operator defined as
and denotes the Fenchel-Rockafellar conjugate of a function . The proximity operator of the conjugate can be computed according to Moreau identity that is for .
Algorithm 1 involves two parameters and . In finite dimensions, the convergence of the sequence to is insured when
Additional technical assumptions can be found in [Theorem 3.1].
This algorithmic scheme can be extended for dealing with more than three functions by setting involving the computation of . The interest of this scheme compared to ADMM is twofold: it first makes the possibility to deal with the gradient of the differentiable function and second it avoids to invert .
3.1 Specificity for minimizing (6)
and for strong hierarchy:
The difficulty comes from the computation of whose expressions are provided in next section.
3.2 New epigraphical projection
We first focus on the derivation of .
Let . The projection of on the epigraphic set reads
where is an ordered version of in an ascending order and with , and such that .
According to the definition of the epigraphical projection, we solve the following minimization problem:
The inner minimization yields a simple projection for to , i.e. , thus Eq. (13) becomes:
where . We now focus on the computation of .
First, we sort in ascending order to be , such that , and also ; and second can be found such that , then , so the proximity operator of has:
and that leads to
The above result is obtained from arguments closed than the ones derived in  [Proposition 5] for an epigraphical constraint of the form where denotes positive weights.
Next proposition is a preliminary result to derive .
Let . The projection of on the epigraphic set reads
and where is an ordered version of in an descending order.
The Lagrangian associated to the function involved in is:
The KKT conditions are:
From the fourth condition in Eq. (21), we know that if then and . Thus, if we denote is an ordered version of in a decreasing order, we can write
From the second and third equality of the KKT conditions, we have . Finally, combining (23) with , yields to
Next we get the solution of from the ones of projection to according to [Lemma 3] and have the following proposition:
The projection of on the epigraphic set reads
where and denote the componentwise absolute value and the componentwise sign.
This result can be directly derived from [Lemma 3] and the proof is therefore omitted here.
In this section, we provide comparison results of both and proposed approaches with the two closest state-of-the-art procedures, that are “hierNet”  and Jenatton’s framework . Comparisons are of two types, comparisons in terms of convergence behavior and in term of performing estimation.
4.1 Simulated data
The dataset is created according to . It is initially composed of main features and of interactions (due to symmetry). We denote ( the ground truth generated according to strong hierarchy. denotes a sparse vector where the non-zero values are associated to randomly selected indexes. The value for the non-zero is randomly selected from the set . Due to strong hierarchy constraint, is only non-zero for those whose main effects are not zero. The values are randomly chosen from the set .
The algorithmic performance is evaluated on two simulated datasets. The first one is composed with features, the first 10 main features are non-zeros and the interaction ratio is (Dataset30-345). The second dataset is of size , the first 30 features are non-zeros and the interaction ratio is (Dataset100-030).
The datasets are composed with a training, a validation and a testing set with 100 samples each. is randomly generated according to normal distribution . The observation value for each sample is set by , where is independent Gaussian noise to make the signal-to-noise ratio approximately equals to dB.
Comparison with  – and weak hierarchy
In order to provide fair comparison with the work in , we first investigate the performance of proposed primal-dual proximal algorithm (Algorithm 2) when and in the configuration of weak hierarchy (no symmetry constraint is considered, i.e. ): “Weak-PD-”. The algorithmic procedure designed in  is based on “FISTA” and it will be named “Weak-FISTA-”. Both are compared in two respects: the convergence of the objective function (5) and the convergence of the iterates (i.e. ).
The comparisons are displayed in Fig. 2 (for Dataset30-345) and in Fig. 3 (for Dataset100-030). The convergence behaviour are presented both w.r.t iterations (first row) and time (second row). It is observed that:
From the objective function convergence point-of-view, “Weak-FISTA-” converges faster than ’Weak-PD-” w.r.t iteration numbers and also time.
From the convergence of the iterates point of view (second column of Fig. 2 and 3), “Weak-PD-” converges much faster than “Weak-FISTA-” to the optimal solution, especially from the comparison in terms of time (bottom figures). It shows that the proposed algorithm enables to be closest to the optimal solution with less time;
The conclusions are very close for both datasets (Dataset30-345 and Dataset100-030) leading to the conclusion that the proposed method is robust to dimensionality.
Contrary to“Weak-FISTA-”, the proposed solution has a ’Strong-PD-” counterpart without inner iterations.
Comparison with  – and strong hierarchy
Similar experiments are conducted for and strong hierarchy (i.e., ). The proposed algorithm (Algorithm 3 with ) is called “Strong-PD-” and it is compared with “hierNet” whose iterations are derived from an ADMM scheme, named“Strong-ADMM-” in our experiments.
The comparisons between these two schemes are displayed in Fig. 4 (for Dataset30-345) and Fig. 5 (for Dataset100-030). Both are compared in three respects: the convergence of the objective function (5), the convergence of the iterates (i.e. ), and the distance to the set . The evolutions are displayed w.r.t iteration number and time. It is observed that:
“Strong-PD-” and “Strong-ADMM-” converge to the same solution but the convergence of the objective to the optimum solution is very sensitive to the trade-off hyper-parameter for“Strong-ADMM-”.
“Strong-PD-” is always faster either in iteration or in time and either in terms of functional or in terms of iterations. The explanation mainly comes from the inner iterations required with “Strong-ADMM-” when dealing with ;
Discussion regarding the choice of
From the above algorithmic comparisons we have observed that the proposed method either for or deliver an accurate solution (as other algorithmic strategies) but faster and with the possibility to include the strong-hierarchy constraint without inner iterations. In the following experiments, we will thus focus on comparisons between weak/strong or using the proposed algorithm (“Weak-PD-” , “Weak-PD-” , “Strong-PD-” , “Strong-PD-”). We perform 20 simulations and the average performance with the associated variances are presented in Fig. 6 for different values of the regularization parameter . Moreover, the performance obtained by cross-validation are presented Table 1.
The good behaviour of the strong hierarchy constraint clearly appear for both datasets. Indeed, we recall that the data have been created with strong hierarchy structure and we can clearly observe that for both datasets the performance with the strong hierarchy constraint are always better either for or .
In our set of experiments, the regularization with always leads to better performance than with .
MSE values associated with the Dataset100-030 are larger, probably due to overfitting. This assumption can be validated by the results obtained on the training dataset.
4.2 Application to HIV Data
In this section, we apply the proposed algorithms to HIV dataset on the susceptibility of the HIV-1 virus to six nucleoside reverse transcriptase inhibitors (NRTIs). This dataset
Followed by , we first work on a reduced dataset over the bins of ten adjacent loci, rather than all 240 genomic locations, resulting from the fact that nearby genomes have similar effects to the drug susceptibility, and also leading to less sparse data. So for the first set of experiments we have features. The value for each bin is set to 1 if one of genomes in that bin occurs mutation.
The dataset is randomly split into two half sets for training and test respectively as adopted by [1, 5]. We report the average performance of MSE on the test set over 5 splits under different for “Weak-PD-”, “Strong-PD-”, “Weak-PD-” and “Strong-PD-” in Fig. 7 (left).
It can be observed that “Strong-PD-” and “Strong-PD-” are slightly better than their weak counterparts, and the best for “Strong-PD-” and “Strong-PD-” are 12 and 10 respectively. The interactions for one realization are visualized in the Fig. 8 (left). In order to better visualize the effect of the penalizations, we also plot the -norm and -norm over each row of for strong hierarchy, as shown in the middle and right of Fig. 8. Both have the strong effect for 19 feature, which is consistent with the results observed in . We also observe an interesting fact from the visualization of interactions, it can been seen that “Strong-PD-” is able to learn a sparser interactions matrix than “Strong-PD-”, however, the maximum strength (i.e -norm) and the sum of strength (- norm) in Fig. 8 have similar distributions for both cases. The possible reason is that the extra interactions learned by “Strong-PD-” are subtle, it can be also confirmed both performance are very similar in the Fig. 7 (left).
We further work on the full data without binning operation leading to features and each feature gives the indicator of mutation. The data splitting is the same as the one previously described. The average performance of MSE on the test set under four situations are shown in Fig. 7 (right). It is clear to demonstrate that strong hierarchy obtains better performance than weak one. The learned interactions and their feature effects from “Strong-PD-” and “Strong-PD-” with best are shown in Fig. 9. It is found that both algorithms can detect that 184 genomic location has the most strong effect. From the maximum strength and sum of strength for each gene in the Fig. 9, we can see the different properties for both algorithms. For “Strong-PD-”, maximum strength for each gene has a more sparse distribution than sum of strength, which is consistent with its objective that the maximum of absolute values of