Sparse hierarchical interaction learning with epigraphical projection

Sparse hierarchical interaction learning with epigraphical projection


This work focus on regression optimization problem with hierarchical interactions between variables, which is beyond the additive models in the traditional linear regression. We investigate more specifically two different fashions encountered in the literature to deal with this problem: “hierNet” and structural-sparsity regularization, and study their connections. We propose a primal-dual proximal algorithm based on epigraphical projection to optimize a general formulation of this learning problem. The experimental setting first highlights the improvement of the proposed procedure compared to state-of-the-art methods based on fast iterative shrinkage-thresholding algorithm (i.e. FISTA) or alternating direction method of multipliers (i.e. ADMM) and second we provide fair comparisons between the different hierarchical penalizations. The experiments are conducted both on the synthetic and real data, and they clearly show that the proposed primal-dual proximal algorithm based on epigraphical projection is efficient and effective to solve and investigate the question of the hierarchical interaction learning problem.


Regression learning, hierarchical interaction, primal-dual algorithm, epigraphical projection


1 Introduction


Learn interactions between features is of main interest in data processing such as for genomics [1, 2]. Graphical Lasso [3] is an efficient strategy to obtain a graph of interactions from the inverse covariance matrix but its limitation is to not being adapted for a specific task such as regression, which is the core of this contribution. To learn the interactions in the context of regression, additive sparse regression model has to be extended. Several work have been dedicated to this subject in a recent literature [4, 1, 5, 6, 7]. In this contribution we improve on these recent works by providing a common framework and an efficient minimization strategy based on epigraphical projection and primal-dual proximal algorithms.

One challenge for feature interaction learning is that the interaction number quadratically increases as the feature size, for example, 1000 features will have one million possible interactions, this therefore results into an over-fitting problem due to insufficient observation samples in most situations. To overcome this weakness, the regression weights and the quadratic interactions can be assumed to be sparse such that most features would not contribute to the decision. Sparsity-based regularization is known to be mainly suitable when the feature size is larger than the training samples. Various sparsity regularizations have been proposed and extensively studied in the additive model, for instance, involving -pseudo-norm [8], -norm [9], -norm [10], or structural sparsity norm [11]. See [12] for an exhaustive list of sparse-based regularization in learning and [13] to refer to structural sparsity.

This work focuses on classical regression problem, which is still a subject of active research, for instance, [14] in order to reduce the computational cost of support vector regression (SVR) by considering the effective patterns in the training data according to the distribution of nearest neighbours; but also[15] providing a comparisons between -norm SVR and sparse coding algorithm.

In this work, instead of linear regression, we include a quadratic behaviour in the regression learning to improve the performance. The idea to integrate quadratic interactions in the learning problem including discrimination task is not new, for example, discriminant quadratic learning [16, 17] learns the covariance matrix in the quadratic term to improve the discrimination ability. However, the idea to learn covariance matrix combined with sparsity regularization is more recent with the preliminary work by Zhao et al. [4] and the following works [1, 5, 6, 7]. Formally, if we denote the regression training set , the criterion involved in sparse quadratic interactions learning takes the form


where models the regression weight vector, denotes the matrix of interactions, and is the regularization term including sparse regularization and other constraints such as hierarchy constraints.

Our work focusses on feature selection, which aims to remove the redundant features and only keep the informative ones. The relevant features are usually correlated and belong to the same group. Many different works have been done in this respect, for instance, multilayer perceptron (MLP) neural network with “controlled redundancy” [18] is applied to feature selection. Xu et al. [19] proposed a semisupervised feature selection by max-relevance and min-redundancy criterion, as a result, the selected features are strong relevance to the labels under supervised manner and avoid redundancy under unsupervised manner. Feature selection by ranking with SVM with a sparse regularization is proposed in [20], and a reweighted scheme is apply to address nonconvex regularization. In this work, we learn the interactions between the selected features by making some hierarchy constraints, where the interactions only occur between the features, at least one of which is selected (i.e. is not zero).

The hierarchy constraints make the relationship between and , as shown in Fig. 1. Two types of hierarchy constraints have been defined in “hierNet” [1]: (i) weak hierarchy where the interactions happen (i.e., ) if one of the associating main features in is non-zero (i.e., or ) and (ii) strong hierarchy when both associating features are non-zero. Both regularizations can be written as


where the notation and denotes the indicator function1 of a closed convex set that can either model an unconstrained problem or impose symmetric structure for matrix of interactions using for weak or strong hierarchy constraints, respectively. In [7], a more general framework is derived taking the form


where and . The contribution on w.r.t. can be (i) reduced to a -norm with (i.e., ) or (ii) considered globally by keeping the information contained in all the interaction coefficients considering . Most of the state-of-the-art penalization may be interpreted as a specific case of this general setting, for instance:

  • [21] : , ,

  • [1, 5] : and ,

  • [7] : and .

Note that [21] focus on weak hierarchy using . [1] is encountered under the name “hierNet”, while [5] is named “FAMILY” and [7] is “Type-A GRESH”. The latent overlapping group lasso formulation of [1] is known as “glinternet” [6].

Figure 1: Hierarchical feature interaction. The first row is additive features, and the second row is the interactions between pairs of features.

From the algorithmic point of view several strategies can be encountered. In [21] and [7], the iterations are derived from FISTA [22]. This procedure appears to be very efficient when while for it requires inner iterations based on Dykstra algorithm that slower significantly the convergence [23]. On the other hand, [1] and [5] resort to ADMM to deal with .

Using recent development of non-smooth convex optimization, the objective of this work is to provide a unified framework and an efficient algorithmic strategy based on primal-dual proximal algorithms and epigraphical projection to solve (2) when , with and , i.e.


The algorithmic procedure does not resort to inner iterations leading to faster algorithmic procedure.2

Following arguments provided in [1], Section 2 derives an equivalent writing of (4) using epigraphical formulation and recall the relationship with [21]. Section 3 presents the primal-dual proximal algorithm iterations and the convergence guarantees. Its specification for solving the minimization problem (1) with regularization function (4) is specified and new epigraphical projections are provided. Section 4 evaluates the performance of the proposed strategy compared to FISTA and ADMM formulations proposed in [21, 1] both on synthetic data and real data. Based on the new algorithmic procedure, fair comparisons between a penalization involving or can be provided. Finally Section 5 gives our conclusion.

2 Epigraphical formulation

The minimization of (1) with the regularization term defined as in Eq. (4) is difficulty, since it involves non-smooth convex functions and symmetry constraint. In [21], Jenatton et al. solve the problem for in the weak hierarchy formulation by means of proximal algorithm. However, their algorithm is hardly adaptive to strong hierarchy. In [1], the authors reformulate (4) when under an epigraphical formulation, which represents the problem as a set of hard constraints. This reformulation helps in the interpretation and it highlights that a reduction in the shrinkage of certain main effects and an increase in the shrinkage of certain interactions. Its second advantage is to help in the design of the ADMM algorithmic scheme. The merit of this formulation is to split the problem as a set of convex constraints, and it allows us to deal with them by computing their projection onto convex sets. The following proposition gives the epigraphical formulation of our minimization problem.

Proposition 1.

The minimization problem


where is defined in (4) can be equivalently formulated as


where denotes the positive orthant and .

The proof follows arguments derived in [1] for the specific case . Let , (6) can be equivalently written as


The constraints involving can be reformulated as

yielding to . Using , we get the expected result:


The algorithmic strategy designed in next section focus on the epigraphical formulation (6).

3 Primal-dual proximal algorithm

Proximal algorithms are derived from two main frameworks that are forward-backward scheme and Douglas-Rachford iterations (both being deduced from Krasnoselskii-Mann scheme) [24]. FISTA can be presented as an accelerated version of forward-backward iterations [22, 25] while ADMM can be viewed as a Douglas-Rachford in the dual [26]. In the literature dedicated to sparse regression the most encountered algorithm is FISTA when dealing with a sum of two convex functions where one is differentiable with a Lipschitz gradient. When the criterion involves more than two functions, typically additional constraint such as the constraint defined previously, most of the works derive an ADMM procedure or compute the proximity operator by mean of inner iterations which is known to often lead to an approximate solution even if global convergence can be obtained in specific cases [23]. Another class of algorithmic procedures allowing to minimize a criterion with more that two functions, possibly including a differentiable function with a Lipschitz gradient, is the class of primal-dual proximal approaches [27, 28, 29, 30].

Several primal-dual proximal schemes have been derived but one of the most popular is based on forward-backward iterations [29, 30] in order to estimate


where denotes a bounded linear operator, where and being two Hilbert spaces, , and denote convex, l.s.c and proper functions. We additionally assume that is differentiable with a Lipschitz constant denoted . Iterations are summarized in Algorithm 1 involving the proximity operator defined as

and denotes the Fenchel-Rockafellar conjugate of a function . The proximity operator of the conjugate can be computed according to Moreau identity that is for .

Algorithm 1 involves two parameters and . In finite dimensions, the convergence of the sequence to is insured when

Additional technical assumptions can be found in [29][Theorem 3.1].

This algorithmic scheme can be extended for dealing with more than three functions by setting involving the computation of . The interest of this scheme compared to ADMM is twofold: it first makes the possibility to deal with the gradient of the differentiable function and second it avoids to invert .

Initialization: Set , ,

Algorithm 1 Primal-dual splitting algorithm

3.1 Specificity for minimizing (6)

We first provide the iterations of Algorithm 1 specified to the minimization of (6). We set , for weak hierarchy, we can split it as follows:


and for strong hierarchy:


The respective iterations are summarized in Algorithm 2 and 3. The projection onto the positive orthant and on have well known closed form expressions [24] that are:

The difficulty comes from the computation of whose expressions are provided in next section.

Parameter settings: Set , , such that .
Initialization: .

Algorithm 2 – Weak-PD- – Primal-dual splitting algorithm to solve (8) when

Parameter settings: Set , , such that .
Initialization: .

Algorithm 3 – Strong-PD- – Primal-dual splitting algorithm to solve (8) when

3.2 New epigraphical projection

We first focus on the derivation of .

Proposition 2.

Let . The projection of on the epigraphic set reads



where is an ordered version of in an ascending order and with , and such that .


According to the definition of the epigraphical projection, we solve the following minimization problem:


The inner minimization yields a simple projection for to , i.e. , thus Eq. (13) becomes:


and thus


where . We now focus on the computation of .
First, we sort in ascending order to be , such that , and also ; and second can be found such that , then , so the proximity operator of has:


and that leads to


The above result is obtained from arguments closed than the ones derived in [31] [Proposition 5] for an epigraphical constraint of the form where denotes positive weights.

Next proposition is a preliminary result to derive .

Proposition 3.

Let . The projection of on the epigraphic set reads





and where is an ordered version of in an descending order.


The Lagrangian associated to the function involved in is:


The KKT conditions are:


From the fourth condition in Eq. (21), we know that if then and . Thus, if we denote is an ordered version of in a decreasing order, we can write


From similar arguments than in [32][Lemma 2] , is computed as (19) and thus


From the second and third equality of the KKT conditions, we have . Finally, combining (23) with , yields to


Next we get the solution of from the ones of projection to according to [32][Lemma 3] and have the following proposition:

Proposition 4.

The projection of on the epigraphic set reads



where and denote the componentwise absolute value and the componentwise sign.

This result can be directly derived from [32][Lemma 3] and the proof is therefore omitted here.

Integrating the projection derived in Proposition 2 into Algorithm 2 or Proposition 3 and Proposition 4 into Algorithm 3 with

insure the convergence to a minimizer of (6) and, from Proposition 1, to a minimizer of (5), respectively for weak hierarchy and strong hierarchy.

4 Experiments

In this section, we provide comparison results of both and proposed approaches with the two closest state-of-the-art procedures, that are “hierNet” [1] and Jenatton’s framework [21]. Comparisons are of two types, comparisons in terms of convergence behavior and in term of performing estimation.

4.1 Simulated data


The dataset is created according to [5]. It is initially composed of main features and of interactions (due to symmetry). We denote ( the ground truth generated according to strong hierarchy. denotes a sparse vector where the non-zero values are associated to randomly selected indexes. The value for the non-zero is randomly selected from the set . Due to strong hierarchy constraint, is only non-zero for those whose main effects are not zero. The values are randomly chosen from the set .

The algorithmic performance is evaluated on two simulated datasets. The first one is composed with features, the first 10 main features are non-zeros and the interaction ratio is (Dataset30-345). The second dataset is of size , the first 30 features are non-zeros and the interaction ratio is (Dataset100-030).

The datasets are composed with a training, a validation and a testing set with 100 samples each. is randomly generated according to normal distribution . The observation value for each sample is set by , where is independent Gaussian noise to make the signal-to-noise ratio approximately equals to dB.

Comparison with [21] and weak hierarchy

In order to provide fair comparison with the work in [21], we first investigate the performance of proposed primal-dual proximal algorithm (Algorithm 2) when and in the configuration of weak hierarchy (no symmetry constraint is considered, i.e. ): “Weak-PD-”. The algorithmic procedure designed in [21] is based on “FISTA” and it will be named “Weak-FISTA-”. Both are compared in two respects: the convergence of the objective function (5) and the convergence of the iterates (i.e. ).

Figure 2: Comparison between the proposed “Weak-PD-” and “Weak-FISTA-” on Dataset30-345 for . (top-left) Objective function (5) w.r.t iterations, (bottom-left) Objective function (5) w.r.t time, (top-right) Distance w.r.t iterations, (bottom-right) Distance w.r.t time.

The comparisons are displayed in Fig. 2 (for Dataset30-345) and in Fig. 3 (for Dataset100-030). The convergence behaviour are presented both w.r.t iterations (first row) and time (second row). It is observed that:

  1. “Weak-PD-” and “Weak-FISTA-” converge to the same value as shown in the first column of both Fig. 2 and 3;

  2. From the objective function convergence point-of-view, “Weak-FISTA-” converges faster than ’Weak-PD-” w.r.t iteration numbers and also time.

  3. From the convergence of the iterates point of view (second column of Fig. 2 and 3), “Weak-PD-” converges much faster than “Weak-FISTA-” to the optimal solution, especially from the comparison in terms of time (bottom figures). It shows that the proposed algorithm enables to be closest to the optimal solution with less time;

  4. The conclusions are very close for both datasets (Dataset30-345 and Dataset100-030) leading to the conclusion that the proposed method is robust to dimensionality.

  5. Contrary to“Weak-FISTA-”, the proposed solution has a ’Strong-PD-” counterpart without inner iterations.

Figure 3: Comparison between the proposed “Weak-PD-” and “Weak-FISTA- on Dataset100-030 for . (top-left) Objective function (5) w.r.t iterations, (bottom-left) Objective function (5) w.r.t time, (top-right) Distance w.r.t iterations, (bottom-right) Distance w.r.t time.
Figure 4: Comparison between the proposed “Strong-PD-” and “Strong-ADMM-” on Dataset30-345 for . (top-left) Objective function (5) w.r.t iterations, (bottom-left) Objective function (5) w.r.t time, (top-middle) Distance w.r.t iterations, (bottom-middle) Distance w.r.t time, (top-right) Distance to the strong hierarchy constraint w.r.t iterations, (bottom-right) Distance to the strong hierarchy constraint w.r.t. time.
Figure 5: Comparison between the proposed “Strong-PD-” and “Strong-ADMM-” on Dataset100-030 for . (top-left) Objective function (5) w.r.t iterations, (bottom-left) Objective function (5) w.r.t time, (top-middle) Distance w.r.t iterations, (bottom-middle) Distance w.r.t time, (top-right) Distance to the strong hierarchy constraint w.r.t iterations, (bottom-right) Distance to the strong hierarchy constraint w.r.t. time.

Comparison with [1] and strong hierarchy

Similar experiments are conducted for and strong hierarchy (i.e., ). The proposed algorithm (Algorithm 3 with ) is called “Strong-PD-” and it is compared with “hierNet” whose iterations are derived from an ADMM scheme, named“Strong-ADMM-” in our experiments.

The comparisons between these two schemes are displayed in Fig. 4 (for Dataset30-345) and Fig. 5 (for Dataset100-030). Both are compared in three respects: the convergence of the objective function (5), the convergence of the iterates (i.e. ), and the distance to the set . The evolutions are displayed w.r.t iteration number and time. It is observed that:

  1. “Strong-PD-” and “Strong-ADMM-” converge to the same solution but the convergence of the objective to the optimum solution is very sensitive to the trade-off hyper-parameter for“Strong-ADMM-”.

  2. “Strong-PD-” is always faster either in iteration or in time and either in terms of functional or in terms of iterations. The explanation mainly comes from the inner iterations required with “Strong-ADMM-” when dealing with ;

  3. Third column of Fig. 4 and  5 highlights that the constraint violations (i.e. distance to ) with the proposed method is always smaller than with ADMM.

Figure 6: The performance in MSE on the validation set obtained with the four derived algorithms (“Weak-PD-” , “Weak-PD-” , “Strong-PD-” , “Weak-PD-” ) allowing to compare properly the different configurations of the regularization term (4). (left) Dataset30-345 (right) Dataset100-030).

Discussion regarding the choice of

From the above algorithmic comparisons we have observed that the proposed method either for or deliver an accurate solution (as other algorithmic strategies) but faster and with the possibility to include the strong-hierarchy constraint without inner iterations. In the following experiments, we will thus focus on comparisons between weak/strong or using the proposed algorithm (“Weak-PD-” , “Weak-PD-” , “Strong-PD-” , “Strong-PD-”). We perform 20 simulations and the average performance with the associated variances are presented in Fig. 6 for different values of the regularization parameter . Moreover, the performance obtained by cross-validation are presented Table 1.

  1. The good behaviour of the strong hierarchy constraint clearly appear for both datasets. Indeed, we recall that the data have been created with strong hierarchy structure and we can clearly observe that for both datasets the performance with the strong hierarchy constraint are always better either for or .

  2. In our set of experiments, the regularization with always leads to better performance than with .

  3. MSE values associated with the Dataset100-030 are larger, probably due to overfitting. This assumption can be validated by the results obtained on the training dataset.

4.2 Application to HIV Data

In this section, we apply the proposed algorithms to HIV dataset on the susceptibility of the HIV-1 virus to six nucleoside reverse transcriptase inhibitors (NRTIs). This dataset3 is collected by [33] and it is used to model HIV-1 susceptibility to the drugs because HIV-1 virus can become resistant through genome mutations for different subjects. In the dataset, there are 639 subjects and 240 genomic locations. For each observation, the mutation status at each genomic location is recorded and also give six drugs (log) susceptibility responses. In this work, we focus on the prediction model for 3TC drug.

Data Approach TR VAL TE
Dataset30-347 Weak-PD- 14 0.1660.013 0.6060.072 0.5960.080
Strong-PD- 16 0.1930.013 0.5330.059 0.5360.062
weak-PD- 8 0.2050.017 0.4760.046 0.4820.051
Strong-PD- 8 0.2030.017 0.4710.046 0.4770.052
Dataset100-030 Weak-PD- 8 0.0210.001 2.3901.216 2.2821.125
Strong-PD- 12 0.0470.002 1.0330.235 1.0280.280
weak-PD- 10 0.1090.006 1.0500.217 1.0620.251
Strong-PD- 10 0.1130.006 0.9180.179 0.9290.199
Table 1: The comparison results (MSE) on the train, validation and test set of different algorithms under best selected when both datasets are used.

Reduced dataset

Followed by [5], we first work on a reduced dataset over the bins of ten adjacent loci, rather than all 240 genomic locations, resulting from the fact that nearby genomes have similar effects to the drug susceptibility, and also leading to less sparse data. So for the first set of experiments we have features. The value for each bin is set to 1 if one of genomes in that bin occurs mutation.

The dataset is randomly split into two half sets for training and test respectively as adopted by [1, 5]. We report the average performance of MSE on the test set over 5 splits under different for “Weak-PD-”, “Strong-PD-”, “Weak-PD-” and “Strong-PD-” in Fig. 7 (left).

It can be observed that “Strong-PD-” and “Strong-PD-” are slightly better than their weak counterparts, and the best for “Strong-PD-” and “Strong-PD-” are 12 and 10 respectively. The interactions for one realization are visualized in the Fig. 8 (left). In order to better visualize the effect of the penalizations, we also plot the -norm and -norm over each row of for strong hierarchy, as shown in the middle and right of Fig. 8. Both have the strong effect for 19 feature, which is consistent with the results observed in [5]. We also observe an interesting fact from the visualization of interactions, it can been seen that “Strong-PD-” is able to learn a sparser interactions matrix than “Strong-PD-”, however, the maximum strength (i.e -norm) and the sum of strength (- norm) in Fig. 8 have similar distributions for both cases. The possible reason is that the extra interactions learned by “Strong-PD-” are subtle, it can be also confirmed both performance are very similar in the Fig. 7 (left).

Full dataset

We further work on the full data without binning operation leading to features and each feature gives the indicator of mutation. The data splitting is the same as the one previously described. The average performance of MSE on the test set under four situations are shown in Fig. 7 (right). It is clear to demonstrate that strong hierarchy obtains better performance than weak one. The learned interactions and their feature effects from “Strong-PD-” and “Strong-PD-” with best are shown in Fig. 9. It is found that both algorithms can detect that 184 genomic location has the most strong effect. From the maximum strength and sum of strength for each gene in the Fig. 9, we can see the different properties for both algorithms. For “Strong-PD-”, maximum strength for each gene has a more sparse distribution than sum of strength, which is consistent with its objective that the maximum of absolute values of