A Learning Theory Approach to a Computationally Efficient Parameter Selection for the Elastic Net

A Learning Theory Approach to a Computationally Efficient Parameter Selection for the Elastic Net

E. De Vito Email: devito@dima.unige.it Università di Genova Željko Kereta Email: zeljko@simula.no Valeriya Naumova Email: valeriya@simula.no
Abstract

Despite recent advances in regularisation theory, the issue of parameter selection still remains a challenge for most applications. In a recent work the framework of statistical learning was used to approximate the optimal Tikhonov regularisation parameter from noisy data. In this work, we improve their results and extend the analysis to the elastic net regularisation, providing explicit error bounds on the accuracy of the approximated parameter and the corresponding regularisation solution in a simplified case. Furthermore, in the general case we design a data-driven, automated algorithm for the computation of an approximate regularisation parameter. Our analysis combines statistical learning theory with insights from regularisation theory. We compare our approach with state-of-the-art parameter selection criteria and illustrate its superiority in terms of accuracy and computational time on simulated and real data sets.

Keywords: parameter selection, elastic net regularisation, iterative thresholding, sub-gaussian vectors, matrix concentration inequalities

1 Introduction

Many applications deal with the problem of the recovery of an unknown quantity of interest from its corrupted observations . In most cases the relationship between and can be approximately described by the following inverse problem

where is a known matrix describing the measuring apparatus, is a zero-mean isotropic random vector, modelling the noise, and is the noise level. Inverse problems of this type are ubiquitous in image processing tasks, such as: denoising, where A is an identity operator; deblurring, where A is a convolution operator; and inpainting, where A is a masking operator.

The recovery of the original signal from the corrupted observation is an ill-posed inverse problem. Thus, theory of inverse problems suggests the use of suitable regularization techniques [18]. Specifically, we are interested in approximating with the minimizer of the following functional

(1)

where is the Euclidean norm modelling data-fidelity, is a penalty term encoding an a priori knowledge on the nature of the true solution, and is a regularization parameter determining a trade-off between these two terms. Having the penalty term fixed, a central issue concerns selection of . The optimal parameter is the one that minimizes the discrepancy between the minimizer of (1) and the exact solution 

(2)

Unfortunately, is unknown and there is often no information regarding either the noise or the noise level . Hence, needs to be approximated.

Choosing a good approximation to is a non-trivial, problem-dependent task and has motivated significant research on parameter selection over the last decades. However, there is still no universal framework for fast and efficient data-driven parameter selection. In this paper, we aim at (partially) closing this gap and provide a novel concept for automated parameter selection by recasting the problem to the framework of statistical learning theory. Specifically, inspired by very recent and (to our knowledge) first results in this spirit for Tikhonov regularization [11], we propose a method for learning the optimal parameter for elastic net regularization [39] from a dataset of corrupted observations.

Existing parameter selection methods.

Parameter selection rules used in theory and in practice can be broadly classified as those based on the discrepancy principle ([28, 1] and references therein), generalized cross-validation (GCV)[22], balancing principle [26, 36], quasi-optimality [33, 24] and various estimations of the mean-squared error (MSE) (see [29, 12] and references therein). GCV is a popular parameter selection rule for linear methods since it gives a closed form of the regularization parameter and does not require tuning of any additional parameters or the knowledge of the noise. Follow-up studies [38] extended GCV to nonlinear estimators, but those cover specialized cases and do not provide a closed form or implicit equations for computing Balancing principle has received a lot of attention in the inverse problems community and has also been studied in the framework of learning theory. Quasi-optimality is one of the simplest available parameter choice methods. Though it is not as stable as the balancing principle, it does not require any information about the problem. Discrepancy and MSE-based principles still remain the preferred methods for parameter selection for nonlinear estimators due to their simplicity and accuracy. We refer to a recent rather comprehensive comparative study on the existing approaches [3].

In order to select the regularization parameter most existing methods require the regularized solution to be computed for various , i.e., over a predefined grid of values. The ’optimal’ regularization parameters are then chosen according to some criteria, e.g., loss over a validation set. To find regularization parameters by an exhaustive search is a computationally expensive task, especially in the high-dimensional data scenario, with often no guarantees on the quality of approximation. Moreover, selection criteria often presuppose that some a priori information is readily available, such as an accurate estimate of the noise level (in e.g. discrepancy principle) or bounds on the noise error (in e.g. balancing principle). Furthermore, most parameter selection methods require additional, method-specific parameters to be preselected. Considering these issues it is fair to say that state-of-the-art methods usually require tedious manual adjustments before they can be utilized. The situation becomes even more challenging when the penalty term takes a more complex form, such as the elastic net or total variation (as commonly used for image processing tasks). The question of how to obtain an accurate regularized solution , with a near-optimal parameter , while ensuring low computational complexity and minimizing the need for manual intervention is the main motivation of our work. In particular, we propose to recast the task of parameter selection to the framework of statistical learning, where one is interested in learning a function , from a training set of corrupted data, such that is a good approximation of the optimal parameter

Elastic net regularization.

Inspired by recent results in this spirit [11], in this paper we propose a method for learning the optimal parameter for elastic net regularization [39] which is efined by

(3)

where is a hyperparameter controlling the trade-off between the (Lasso) and the (Tikhonov) penalty terms. Our main motivation for considering the elastic net is that they have been shown to produce sparse models comparable to , while often achieving superior accuracy in real-world and simulated data. The elastic net overcomes main limitations of minimization, namely, it encourages the grouping effect, which is relevant for many real-life applications such as microarray classification and face detection, see [10] and references therein.

In the following, we summarize the main algorithmic and theoretical developments related to the elastic net, which was first proposed by Zou and Hastie [39]. Therein the authors rewrite the elastic net functional as Lasso regularization with augmented datum. Then, the authors apply LARS [16] for the effective reconstruction of the entire solution path, and use cross-validation for the selection of the optimal regularization parameter. Later work [10] studies the theoretical properties of (3) in the context of learning theory, analyses the underlying functional and uses iterative soft-thresholding to compute the solution. Moreover, as the parameter choice rule authors provide an adaptive version of the balancing principle [26, 36]. The rule aims to balance approximation and sample errors, which have opposite behaviour with respect to the tuning parameter. At the same time, the proposed rule requires multiple estimations of for various values of , which makes it less applicable for high-dimensional tasks. We will rework some of the arguments from [10] for the computation of , while keeping our focus on an efficient approach for parameter learning. In [25] the authors propose an active set algorithm for solving (3). Addressing the problem in the framework of classical regularization theory, the authors consider the discrepancy principle [28, 6] for determining the parameter. As for the balancing principle, implementation of the discrepancy principle requires multiple estimations of the solution for different parameter values, and a pre-tuning of other, method-specific parameters. Moreover, it is assumed that the noise level is known, which is often not the case in practice.

Paper [27] proposes a hybrid method for tuning parameters and in a model fitting problem where and The authors propose an alternating method which updates the solution using coordinate descent and then updates and in one iteration. The main advantage is its efficiency, as one does not need to calculate for multiple parameters at once, but rather calculate the solution on a much coarser parameter grid. The method is in spirit similar to the LARS method, but has better scalability. At the same time, it requires a non-convex problem to be solved, and hence has inherent limitations. Moreover, this method is not directly applicable for our inverse-problem setting, where we do not observe changes in the same solution but rather assume that the design matrix A is fixed and each response is generated by a new original signal . In summary, we are not aware of any parameter choice rule for elastic net that allows to select a parameter without any a priori assumptions and manual adjustments.

This paper leverages the work [11] where the parameter selection is considered in the context of nonparametric regression with random design. In particular, the authors propose a fully data-driven method for the determination of the optimal parameter for Tikhonov regularization under the assumption that a training set of independent observations is made available, each of them associated with an (unknown) signal , through The goal is to find a prediction function based on the given training set, which provides a good approximation to the optimal parameter for a new datum The starting point of the method is to find an empirical proxy to the real solution by assuming that are distributed over a lower-dimensional linear subspace. The authors show that , where is the pseudo-inverse of A and is the projection onto a suitable eigenspace of the empirical covariance matrix, is a good proxy to provided is large enough. Then, using the proxy, one can select the regularization parameter as the minimizer of

(4)

where is the minimizer of the Tikhonov functional

The analysis and the techniques related to the construction of are independent of the choice of the optimisation scheme, whereas the selection of is defined by the regularization scheme. However, it is worthwhile to mention that if is not injective, is not a good proxy of . For Tikhonov regularization this is not an issue as, without loss of generality, we can always assume that A is injective. Specifically, one can replace in (2) with and recall that belongs to for all . Therefore, for a wider applicability of the suggested framework, it is important to address the construction of , and the selection of for a wider class of problems.

In this paper, we extend the framework of  [11] by providing the analysis and algorithms to the elastic net, which is a non-linear method, and to non-injective operators. Moreover, we improve some of the theoretical results, and develop an efficient, fully automated algorithm. To do this, we analyse our problem in two settings:

  1. simplified case: for the case (corresponding to image denoising) and when are independent Bernoulli random vectors. We provide a bound on with high probability and discuss the number of samples required for an optimal learning, see Proposition 3.4. We also consider the case of a bounded , which motivates our algorithm. Though the model might be oversimplified, it captures the essence of the problem and our experiments confirm the results for more general settings.

  2. general case: for a general matrix we provide an efficient and accurate algorithm based on gradient-descent for the computation of an approximate optimal parameter. We illustrate the performance of our algorithm comparing it to state-of-the-art parameter choice methods on a number of synthetic and imaging problems. The obtained results show that our approach has superior accuracy and often outperforms other methods from a computational perspective.

1.1 Outline

In Section 2, we describe the main ingredients for our approach. Therein we define and prove bounds on our empirical estimators, discuss minimizers of the elastic net (3), and define loss functions that will be used for approximately optimal parameter selection. Section 3 provides the main theoretical results of the paper, showing that in a simplified setting the empirical estimator of the optimal parameter is indeed close to the true one. In Section 4 we present an efficient and accurate algorithm for the computation of an approximate optimal parameter. We discuss the performance of the parameter learning method by means of several numerical experiments on synthetic and imaging data in Section 5. Therein, we compare our method with state-of-the-art parameter selection criteria in terms of accuracy of the solution recovery, closeness to the optimal parameter, and computational time. Our focus on imaging data is on wavelet denoising, where we work on synthetic data sets, as well as real-world brain MRI data set with heteroscedastic noise. Finally, we conclude with a brief discussion about future directions in Section 6. The Appendix contains proofs of some auxiliary, and technical results.

1.2 Notation

The Euclidean and the -norms of are denoted by and , respectively. The modulus function , the sign function , and the positive part function are defined component-wise for by

where for any

We denote by the canonical basis of . For a matrix M, we denote its transpose by its Moore-Penrose pseudo inverse by , and its spectral norm by . Furthermore, and are the range and the null space of respectively. For a square-matrix M, we use to denote its trace. The identity matrix is denoted by and we use to denote the indicator function of a set . For any , is the rank one operator acting on as .

A random vector is called sub-Gaussian if

The value is called the sub-Gaussian norm of , with which the space of sub-Gaussian vectors becomes a normed vector space [35]. The (non-centred) covariance of a random vector is denoted as

Finally, we write if there exists an absolute constant such that

2 Problem setting

We consider the following stochastic linear inverse problem: given a deterministic matrix , we are interested in recovering a vector from observations

(5)

where

  1. the unknown datum is a sub-Gaussian vector, such that ;

  2. there exists a subfamily of indices, with , such that

    and ;

  3. the noise is an independent sub-Gaussian vector, such that , and is the noise level.

Conditions 1 and 3 are standard assumptions on the distributions of the exact datum and the noise ensuring that the tails have fast decay, and note that normalisation conditions on and can always be satisfied by rescaling. Furthermore, it follows from the definitions that is the smallest subspace such that almost surely. Thus, by 2, the exact datum is -sparse almost surely and, since , it is a unique vector with such a property. Define now The following simple result for an injective A was shown in [11]; here we extend it to the general case.

Lemma 2.1.

Under Assumption 2 we have and .

Proof.

A direct computation gives

where P denotes the orthogonal projection onto . Assumption 2 says that A is injective on , and thus and have the same rank . Furthermore maps onto , so that

where , since is symmetric. ∎

2.1 Empirical estimators

Lemma 2.1 suggests that could be directly recovered if A were invertible and were known. In most practical situations though, neither of those assumptions is satisfied: we only have access to noisy observations and A could be not only non-invertible, but also non-injective. We will address this issue by recasting the entire problem to a statistical learning framework, similar to [11]. Namely, suppose we are given observation samples such that for , where and are unknown, and let

be the empirical covariance of the observations. Standard statistical learning theory suggests that is a good approximation to provided is large enough. As a consequence, we will show that the vector space , which is spanned by the first eigenvectors of , is a good estimator of .

To justify the above claims, observe first that since holds by 3, we have

(6)

Therefore, and have the same eigenvectors and the spectrum of is just a shift of the spectrum of by . Let be the non-zero eigenvalues of , counting for multiplicity, and and be the eigenvalues of and , respectively. From (6) it thus follows

(7)

Let be the (orthogonal) projection onto , which has rank due to Lemma 2.1, and let be the (orthogonal) projection onto . Using matrix concentration inequalities we will now show the fundamental tool of our study: that is an accurate and an unbiased approximation of . We distinguish between cases of bounded and unbounded and improve upon the results in [11].

Lemma 2.2.

Assume that . Given , with probability greater than

(8)

provided . Furthermore, if the observations are bounded, then with probability greater than

(9)

provided .

Proof.

We will first show (8). Using Theorem 9.2.4 and Exercise 9.2.5 in [35], we have with probability greater than

(10)

where is the stable rank of . Using and , we get

(11)

Let . By (7) it follows that is the projection onto the linear span of those eigenvectors of whose corresponding eigenvalue is greater than or equal to . Using , by (11) we have

(12)

provided that . Let now be the projection onto the linear span of those eigenvectors of whose corresponding eigenvalue is greater than or equal to . As a consequence of Theorem 7.3.1 in [5], there exists an eigenvalue of such that

(13)
(14)
(15)

By (13) it follows that so that and hence

(16)

Since , the claim follows by (10).

Assume now that holds almost surely and consider a family of independent matrices

Notice that . Applying the matrix Bernstein inequality for rectangular matrices (Theorem 6.1.1. in [34]) we have for all

where is a matrix variance constant independent of , , and , such that

A direction computation gives . It follows that

holds with probability greater than for every . Moreover, by analogous argumentation (12) holds provided , see A.2 in the Appendix for details. Thus, (9) follows by applying (16). ∎

The previous result comes with a certain caveat. Namely, it is of only theoretical value as the proof implicitly assumes that either or the spectral gap are known (which informs the choice of the approximate projector ). In practice, however a proper eigenspace can only be detected if there is a spectral gap, and if it occurs around the eigenvalue corresponding to the eigenspace we want to recover, i.e., if , where

(17)

Indeed, under this assumption the empirical covariance matrix also has a spectral gap around the desired eigenvalue, as shown by the next result.

Proposition 2.3.

Assume (17) holds. Then the empirical covariance matrix has a spectral gap at the eigenvalue, with probability greater than , provided and .

Proof.

Assume holds for . Since we get

by adding and subtracting and inside the first term. Thus, if then , and if then . For on the other hand we have . In conclusion,

holds provided provided . Using (10) the claim follows. ∎

It is clear that if the empirical covariance matrix will have a spectral gap around

and thus, will be wrongly estimated. In many practical situations though, the situation is not as pessimistic as this result would suggest. In other words, estimating could be relatively easy when the singular values are properly visualized and interpreted, even though applying a rule as direct as the spectral gap would lead to an unwanted conclusion. We devote more attention to this question in Section 5.1, and suggest alternative heuristic means of estimating the intrinsic dimension .

We are now ready to define our empirical estimator of . Define . Note that Q is the orthogonal projector onto , whereas is the orthogonal projector onto . The empirical estimator of is defined as

(18)

For the empirical estimator satisfies the (empirical) inverse problem

(19)

In the following, we use to learn the optimal regularization parameter for elastic net minimization, though this part of the analysis is independent of the choice of the optimization scheme.

Remark 2.4.

One might think of using as a sought solution and then completely avoiding regularization and, thus, the issue of the parameter choice. This has been partially discussed in [11] and the authors show that in some cases, esp. when the training set size is small or noise level is small, is larger than Moreover, as we also show in our numerical experiments, is not a good proxy to when A is not injective. In addition, does not preserve the structure of the original signal, e.g., is, in general, not sparse, whereas is sparse. Therefore, we remind that we are interested in using an estimator for which is small with high probability, and we are not interested in directly controlling , which is the goal in manifold learning [4].

To begin, we observe that minimizers of the empirical problem coincide with minimizers of the original problem.

Lemma 2.5.

Let Then .

Proof.

We compute Since Q is an orthogonal projector onto it follows . Using Pythagoras’ theorem we thus have Since the second term does not depend on we get

as desired. ∎

2.2 Elastic net minimisation

From now on we focus on the parameter choice for the elastic net minimization, where , so that

(20)

The first regularization term, , enforces the sparsity of the solution, whereas the terms gives smoothness and ensures that in case of highly correlated features and we can correctly retrieve the whole relevant group (which is one of the limitations of LASSO minimization). We first recall some basic facts about existence, uniqueness and sensitivity of solutions with respect to regularization parameters [25].

Lemma 2.6.

The functional is strictly convex and coercive (i.e. , for an absolute constant ). Moreover, for each the minimizer of (1) exists and is unique and the mapping is continuous with respect to .

In the remainder of this paper, we redefine the elastic net regularized solution by recasting to a bounded interval as

(21)

where and is a fixed parameter. For the solutions of (21) correspond to (20) for On the other hand, for we get , and for we define where

(22)

This definition is driven by the following observations. First, the set is non-empty (since A has finite rank). Furthermore, it was shown in [10] and [25] that in case of elastic nets minimization

(23)

In other words, vector plays the role of the Moore-Penrose solution in linear regularization schemes [18]. By Lemma 2.6 the minimizer of (21) always exist and is unique, the map is continuous for . Equation  (23) implies that is continuous at , and later in (32) we show that the continuity also holds at .

2.2.1 Quadratic loss functionals

To drive the selection of regularization parameters we go back to the first principles and consider the following quadratic loss functionals.

Definition 2.7.

Functions , defined by

(24)

are called the true and the empirical quadratic loss, respectively. Furthermore,

(25)

are the true and the empirical optimal regularization parameters.

In view of Lemma 2.6 and the subsequent discussion, the decision to cast the regularization parameter from the set of all positive reals to has a clear purpose: true and empirical loss functions are both continuous, defined on a bounded interval, and, hence, achieve a minimum. Since is unknown, is only of theoretical value. Thus, our aim is to minimize while ensuring is small.

Before going into details, let us briefly discuss some difficulties associated with elastic net minimization which need to be addressed. The elastic net solution admits a closed form solution only when and, in other cases, the solution can only be approximated. Furthermore, as we will see below, loss functionals are not globally differentiable, but rather only piecewise. The unavailability of a closed form of and the non-smoothness of loss functionals means that the parameter in general cannot be analysed in full detail. Therefore, in the following we split the analysis into two parts: a simplified case where for we show that is small, and the general case where for any A we provide a simple and efficient algorithm for parameter learning. Furthermore, we need to amend the empirical loss function (24) in the case when measurement matrix A is non injective since cannot be reliably estimated for non linear methods, see Figure 1 and [17]. We follow the idea of SURE-based methods [21], which provide an unbiased estimate of by projecting the regularized solution onto .

\includegraphics [width=]full_rank.pdf \includegraphics [width=]low_rank.pdf \includegraphics [width=]all_together_now.pdf
Figure 1: Empirical and true losses for , , , , , , and zero mean isotropic Gaussians and . In the left panel A is injective and is a good proxy for . In the middle panel and we see that does not approximate well. On the other hand, the right panel shows that in case of a non-injective matrix, modified and projected losses behave really well
Definition 2.8.

Function , defined by

(26)

where is the orthogonal projection onto is called the projected empirical loss. Furthermore,

(27)

is the projected empirical optimal regularization parameters.

Moreover, using the fact that we define the modified projected risk.

Definition 2.9.

Function , defined by

(28)

is called the modified projected risk. Furthermore,

(29)

is the modified empirical optimal regularization parameters.

By considering instead of we avoid the computation of the Moore-Penrose inverse , which might be either costly to compute or indeed numerically unstable if A is poorly conditioned. As we will show in Section 5 and can be seen in the right-most panel in Figure 1, using and instead of when A is non-injective leads to a drastically improved performance. Projecting onto does affect the fundamental shape of the functional by making it somewhat smoother (decreasing the gradients), which we will need to take into account in the numerical part of our study.

2.2.2 Solution via soft-thresholding

As mentioned earlier, elastic net minimization does not admit a closed form solution in case of a general forward matrix A. In the original paper by Zou and Hastie [39], the elastic net problem is recast as a Lasso problem with augmented data, which can then be solved by many different algorithms (e.g. the LARS method [15]). Alternative algorithms compute the elastic net minimizer directly, and are generally either of the active set [25] or the iterative soft-thresholding-type [10]. Here we adhere to iterative soft-thresholding, and rework the arguments in [10] to show that the solution to (21) can be obtained through fixed point iterations for all . To begin, define the soft-thresholding function by

(30)

and the corresponding soft-thresholding operator , acting component-wise on vectors . The following lemma shows that the elastic net solution is a fixed point solution of a contractive map. The proof is included in the Appendix for the sake of completeness.

Lemma 2.10.

The solution to (21), for and , satisfies , where the map is a contraction given by

(31)

with Lipschitz constant

where , where and are the smallest and the largest singular values of the matrix A, respectively.

For , the solution is , which is consistent with (31). Furthermore, by (30) and (31), we get

(32)

Our definition of the solution at , in (22), also satisfies since is identity and, thus, , though the map is not a contraction. In summary, the solutions are consistent with the statement of Lemma 2.6, as expected.

2.2.3 Closed form solution

As shown in [39], in the case of orthogonal design, i.e. , the solution of the elastic net minimization (1) is given by

(33)

Plugging into (33) we have

(34)

3 Parameter Error

Since the elastic net solution is not given in closed form for a general A, a rigorous study of the parameter error is unfeasible in full generality. Therefore, we will restrict our attention to simplified cases, though we should emphasize that our approach in practice performs well on significantly broader model assumptions, which we will demonstrate in Section 5. Consider thus the true and the empirical loss functions (24) for the case of the orthogonal design (). Without loss of generality, we can assume (otherwise redefine to be ). Let now and assume . Plugging (34) into (24) we get

Define , for . Loss function is continuous on , and differentiable on intervals

Focusing on one interval at a time, a direction computation yields that for the minimizer of is given by

(35)

An analogous statement holds for , whereas is constant on , as argued in (32). Therefore, the minimizer of is . The empirical loss function is also continuous on and is piecewise differentiable on the same set of intervals since they depend only on . Consequently, minimizers of are also of the form (3), where we only ought to replace by .

Notice that unless further assumptions are made, minimizers and are not given explicitly: we still need to evaluate and at locations. To address and showcase this issue we analyse two cases, that of Bernoulli noise and of bounded sub-Gaussian noise. This analysis will also give a theoretical intuition that will drive our algorithm.

3.1 Bernoulli Noise

We first consider a simplified model which however simple still encodes the main features of the problem. In particular, let

and assume and for . Without loss of generality we can assume . It then follows for , and otherwise. Moreover, for all . In the following, we will for the sake of simplicity consider the case when . The details regarding the general case, , are in the Appendix.

Let us first consider the true loss function. We have

The function is differentiable for all , , in which case a direct computation yields