Methods for Sparse and Low-Rank Recovery under Simplex Constraints

Methods for Sparse and Low-Rank Recovery under Simplex Constraints

Abstract

The de-facto standard approach of promoting sparsity by means of -regularization becomes ineffective in the presence of simplex constraints, i.e., the target is known to have non-negative entries summing up to a given constant. The situation is analogous for the use of nuclear norm regularization for low-rank recovery of Hermitian positive semidefinite matrices with given trace. In the present paper, we discuss several strategies to deal with this situation, from simple to more complex. As a starting point, we consider empirical risk minimization (ERM). It follows from existing theory that ERM enjoys better theoretical properties w.r.t. prediction and -estimation error than -regularization. In light of this, we argue that ERM combined with a subsequent sparsification step like thresholding is superior to the heuristic of using -regularization after dropping the sum constraint and subsequent normalization.
At the next level, we show that any sparsity-promoting regularizer under simplex constraints cannot be convex. A novel sparsity-promoting regularization scheme based on the inverse or negative of the squared -norm is proposed, which avoids shortcomings of various alternative methods from the literature. Our approach naturally extends to Hermitian positive semidefinite matrices with given trace. Numerical studies concerning compressed sensing, sparse mixture density estimation, portfolio optimization and quantum state tomography are used to illustrate the key points of the paper.

1Introduction

In this paper, we study the case in which the parameter of interest is sparse and non-negative with known sum, i.e., , where for and , is the (scaled) canonical simplex in , and is referred to as -norm (the cardinality of the support ). Unlike the constant , the sparsity level is regarded as unknown. The specific value of the constant is not essential; in the sequel, we shall work with as for all problem instances studied herein, the data can be re-scaled accordingly. The elements of can represent probability distributions over items, proportions or normalized weights, which are quantities frequently arising in instances of contemporary data analysis. A selection is listed below.

  • Estimation of mixture proportions.

    Specific examples are the determination of the proportions of chemical constituents in a given sample or the endmember composition of pixels in hyperspectral imaging [22].

  • Probability density estimation

    , cf. [4]. Let be a probability space with having a density w.r.t. some dominating measure . Given a sample and a dictionary of densities (w.r.t. ), the goal is to find a mixture density well approximating , where .

  • Convex aggregation/ensemble learning.

    The following problem has attracted much interest in non-parametric estimation, cf. [37]. Let be the target in a regression/classification problem and let be an ensemble of regressors/classifiers. The goal is to approximate by a convex combination of .

  • Markowitz portfolios without short positions.

    Given assets whose returns have expectation and covariance , the goal is to distribute the total investment according to proportions such that the variance is minimized subject to a lower bound on the expected return .

Sparsity is often prevalent or desired in these applications.

  • In hyperspectral imaging, a single pixel usually contains few endmembers.

  • In density estimation, the underlying density may be concentrated in certain regions of the sample space.

  • In aggregation, it is common to work with a large ensemble to improve approximation capacity, while specific functions may be well approximated by few members of the ensemble.

  • A portfolio involving a small number of assets incurs less transaction costs and is much easier to manage.

At the same time, promoting sparsity in the presence of the constraint appears to be more difficult as -regularization does not serve this purpose anymore. As clarified in Section 2, the naive approach of employing -regularization and dropping the sum constraint is statistically not sound. The situation is similar for nuclear norm regularization and low-rank matrices that are Hermitian positive semidefinite with fixed trace as arising (e.g.,) in quantum state tomography [16], where the constraint set is given by , where denotes conjugate transposition. Both the presence of simplex constraints and its matrix counterpart thus demand different strategies to deal with sparsity and low-rankedness, respectively. The present paper proposes such strategies that are statistically sound, straightforward to implement, adaptive in the sense that the sparsity level (resp. the rank in the matrix case) is not required to be known, and that work with a minimum amount of hyperparameter tuning.
The problem outlined above is nicely discussed in [27]. In that paper, the authors consider the sparsity level to be known as well and suggest to deal with the constraint set by projected gradient descent based on a near-linear time algorithm for computing the projection. This approach can be seen as a natural extension of iterative hard thresholding (IHT, ). It has clear merits in the matrix case, where significant computational savings are achieved in the projection step. On the other hand, it is a serious restriction to assume that is known. Moreover, the method has theoretical guarantees only under the restricted isometry property (RIP, [6]) and a certain choice of the step size depending on RIP constants, which are not accessible in practice.

[43] suggest the regularizer to promote sparsity on and show that in the case of squared loss, the resulting non-convex optimization problem can be reduced to second-order cone programs. In practice, however, in particular when combined with tuning of the regularization parameter, the computational cost quickly becomes prohibitive.

Relevant prior work also includes [29] and [49] who discuss the problem of this paper in the context of latent variable representations (PLSA, [18]) for image and bag-of-words data. [29] proposes a so-called pseudo-Dirichlet prior correponding to the log-penalty in [7]. [49] suggest to use the Shannon entropy as regularizer. Both approaches are cumbersome regarding optimization due to the singularity of the logarithm at the origin which is commonly handled by adding a constant to the argument, thereby introducing one more hyperparameter; we refer to Section 4.4 below for a more detailed discussion.

A conceptually different approach is pursued in [21]. Instead of the usual loss + -norm formulation with the -norm arising as the convex envelope of the -norm on the unit -ball, the authors suggest to work with the convex envelope of loss + -norm. A major drawback is computational cost since evaluation of the convex envelope already entails solving a convex optimization problem.

Outline and contributions of the paper. As a preliminary step, we provide a brief analysis of high-dimensionsal estimation under simplex constraints in Section 2. Such analysis provides valuable insights when designing suitable sparsity-promoting schemes. An important observation is that empirical risk minimization (ERM) and elements of contained in a “high confidence set” for (a construction inspired by the Dantzig selector of [6]) already enjoy nice statistical guarantees, among them adaptation to sparsity under a restricted strong convexity condition weaker than that in [36]. This situation does not seem to be properly acknowledged in the work cited in the previous paragraph. As a result, methods have been devised that may not even achieve the guarantees of ERM. By contrast, we here focus on schemes designed to improve over ERM, particularly with respect to sparsity of the solution and support recovery. As a basic strategy, we consider simple two-stage procedures, thresholding and re-weighted regularization on top of ERM in §Section 3.

Alternatively, we propose a novel regularization-based scheme in Section 4 in which serves as a relaxation of the -norm on . This approach naturally extends to the matrix case (positive semidefinite Hermitian matrices of trace one) as discussed in Section 5. On the optimization side, the approach can be implemented easily by DC (difference of convex) programming [42]. Unlike other popular forms of concave regularization such as the SCAD, capped or MCP penalties [58] no extra parameter besides the regularization parameter needs to be specified. For this purpose, we consider a generic BIC-type criterion ([48]; [23]) with the aim to achieve correct model selection respectively rank selection in the matrix case. The effectiveness of both the two-stage procedures and the regularization-based approach is demonstrated by several sets of numerical experiments covering applications in compressed sensing, density estimation, portfolio optimization and quantum state tomography. All proofs can be found in the appendix.

For the convenience of readers, we here gather essential notation. We denote by , , the usual -norm or the Schatten -norm depending on the context, and by the usual Euclidean inner product. We use for the cardinality of a set. We write for the indicator function. We denote by the canonical basis of . For , denotes the Euclidean projection on . For functions and , we write and to express that respectively for some constant . We write if both and . We also use the Landau symbols and .

2Simplex constraints in high-dimensional statistical inference: basic analysis

Before designing schemes promoting sparsity under the constraint , it is worthwhile to derive basic performance bounds and to establish adaptivity to underlying sparsity when only simplex constraints are used for estimation, without explicitly enforcing sparse solutions. Note that the constraint is stronger than the -ball constraint . As a consequence, it turns out that ERM enjoys properties known from the analysis of (unconstrained) -regularized estimation, including adaptivity to sparsity under certain conditions. This already sets a substantial limit on what can be achieved in addition by sparsity-promoting schemes.

We first fix some notation. Let be i.i.d. copies of a random variable following a distribution on some sample space . Let further a loss function such that , is convex and differentiable. For , denotes the expected risk and its empirical counterpart. The goal is to recover .

Besides ERM which yields , our analysis simultaneously covers all elements of the set

for suitably chosen as precised below. The construction of is inspired by the constraint set of the Dantzig selector [6], which is extended to general loss functions in [31]. Elements of will be shown to have performance comparable to . The set need not be convex in general. For squared loss it becomes a convex polyhedron. It is non-empty as long as , where . In many settings of interest (cf. [31]), it can be shown that

2.1Excess risk

The first result bounds the excess risk of and , where in the sequel represents an arbitrary element of .

The excess risk of ERM and points in can thus be bounded by controlling , the supremum of the empirical process over all with -distance at most from . This supremum is well-studied in the literature on -regularization. For example, for linear regression with fixed or random sub-Gaussian design and sub-Gaussian errors as well as for Lipschitz loss (e.g. logistic loss), it can be shown that [52]

Using that and , choosing and invoking , Proposition ? yields that the excess risk of ERM and points in scales as . As a result, ERM and finding a point in constitute persistent procedures in the sense of [14].

2.2Adaptation to sparsity

Proposition ? does not entail further assumptions on or . In the present subsection, however, we suppose that and that obeys a restricted strong convexity (RSC) condition defined as follows. Consider the set Observe that . For the next result, we require to be strongly convex over .

Condition ? is an adaptation of a corresponding condition employed in [36] for the analysis of (unconstrained) -regularized ERM. Our condition here is weaker, since the RSC condition in [36] is over the larger set We are now in position to state a second set of bounds.

Invoking and choosing , we recover the rates for squared -error and for -error, respectively. Combining the bounds on -error with and Proposition ?, we obtain

The above rates are known to be minimax optimal for the parameter set and squared loss [55]. Under the -RSC condition, there hence does not seem to be much room for improving over and as far as the -error, -error and excess risk are concerned. An additional plus of is that it does not depend on any hyperparameter.

3Two-stage procedures

While has appealing adaptation properties with regard to underlying sparsity, may be significantly larger than the sparsity level . Note that the -bound of Proposition ? yields that as long as , where . If the aim is to construct an estimator achieving support recovery, i.e., , needs to be further sparsified by a suitable form of post-processing. We here consider two common schemes, thresholding and weighted -regularization. The latter is often referred to as “adaptive lasso” [60]. Specifically, we define

where denotes the indicator function and is a vector of non-negative weights. We here restrict ourselves to the common choice if and otherwise (in which case ), .

A third approach is to ignore the unit sum constraint first, so that -regularization has a sparsity-promoting effect, and then divide the output by its sum as a simple way to satisfy the constraint. Altogether, the two stages are as follows.

An alternative to the second step is to compute the Euclidean projection of on . From the point of view of optimization, has some advantages. Non-negativity constraints alone tend to be easier to handle than simplex constraints. For projected gradient-type algorithms the projection on the constraint set becomes trivial. Moreover, coordinate descent is applicable as non-negativity constraints do not couple several variables (whereas simplex constraints do). Coordinate descent is one of the fastest algorithms for sparse estimation [13], in particular for large values of . On the other hand, from a statistical perspective, it is advisable to prefer since it incorporates all given constraints into the optimization problem, which leads to a weaker condition and eliminates the need to specify appropriately. Indeed, taking a large value of in in order to obtain a highly sparse solution increases the bias and may lead to false negatives. In addition, may also lead to false positives if the “irrepresentable condition” [59] is violated. Our experimental results (cf. §Section 6) confirm that has considerably larger estimation error than ERM.

Model selection. In this paragraph, we briefly discuss a data driven-approach for selecting the parameters and in and when the aim is support recovery. It suffices to pick from , whereas for we consider a finite set . We first obtain resp. and then perform model selection on the candiate models induced by the support sets resp. . Model selection can either be done by means of a hold-out dataset or an appropriate model selection criterion like the RIC in the case of squared loss [12]. To be specific, let , , and suppose that

Then for , the RIC is defined as

Note that minimizing with respect to is equivalent to solving an -norm-regularized least squares problems with regularization parameter . This approach is known to be model selection consistent in a high-dimensional setting [23]. This implies that as long as -norm-regularized least squares with above choice of the regularization parameter achieves support recovery, the same is achieved by thresholding as long as the indices of the largest coefficients of equal (and thus ). The same is true for weighted -regularization provided . In fact, the approach is not specific to thresholding/weighted -regularization and applies to any other method yielding candidate support sets indexed by a tuning parameter.

If is unknown, it has to be replaced by an estimator at least obeying [23]. We refer to [50] for suggestions of specific estimators .

4Regularization with the negative -norm

A natural concern about ERM (optionally followed by a sparsification step) is that possible prior knowledge about sparsity is not incorporated into estimation. The hope is that if such knowledge is taken into account, the guarantees of Section 2 can even be improved upon. In particular, one may be in position to weaken -RSC significantly.

It turns out that any sparsity-promoting regularizer on cannot be convex. To see this, note that if is sparsity-promoting, it should assign strictly smaller values to the vertices of (which are maximally sparse) than to its barycentre (which is maximally dense), i.e.,

where is the standard basis of . However, contradicts convexity of , since by Jensen’s inequality

4.1Approach

For , consider . can be seen as a “robust” measure of sparsity. One has with equality holding iff is constant. By “robustness” we here mean that is small for vectors that have few entries of large magnitude while the number of non-zero elements may be as large as . Using that yields the alternative . As , we have

The map is proposed as a sparsity-promoting regularizer on in [43]. It yields a looser lower bound on than the map advocated in the present work. Both these lower bounds are sparsity-promoting on as indicated by Figure ?.

Contours of \beta \mapsto  \lVert\beta \rVert_2^2 (left) and \beta \mapsto  \lVert\beta \rVert_{\infty} (right) on \Delta^3. Contours of \beta \mapsto  \lVert\beta \rVert_2^2 (left) and \beta \mapsto  \lVert\beta \rVert_{\infty} (right) on \Delta^3. Contours of \beta \mapsto  \lVert\beta \rVert_2^2 (left) and \beta \mapsto  \lVert\beta \rVert_{\infty} (right) on \Delta^3.

From the above discussion, we conclude that finding points of large -norm is a reasonable surrogate for finding sparse solutions. This makes us propose the following modifications of and , respectively.

Note the correspondence of / on the one hand, and the lasso respectively Dantzig selector on the other hand.

Regarding , it appears more consequent to use in place of in light of . Eventually, this is a matter of parameterization. While is the canonical measure of sparsity, is another one. It is lower bounded by . We prefer the negative over the inverse for computational reasons: the optimization problem in is a “difference of convex” (DC) program [42] and hence more amenable to optimization; see the next subsection for details.

The problem in is also a DC program if is convex. Note that for minimizing the negative -norm is equivalent to minimizing the inverse -norm.

4.2Least squares denoising

In order to show that the negative -norm, when combined with simplex constraints, promotes exactly sparse solutions, we elaborate on in the simple setup of denoising. Let , , where and the represent random noise. We consider squared loss, i.e., , . This yields the optimization problem

It turns out that can be recast as Euclidean projection of on , where is a function of . Based on this re-formulation, one can derive conditions on and such that achieves support recovery.

In particular, for for any , the required lower bound on becomes . For the sake of reference, we note that in the Gaussian sequence model with (cf. [20]), we have .

The denoising problem can be seen as least squares regression problem in which the design matrix equals the identity. For general design matrices, analysis becomes harder, in particular because the optimization problem may be neither convex nor concave. In the latter case, the minimum is attained at one of the vertices of .

4.3Optimization

Both and are non-convex in general. Even more, maximizing the Euclidean norm over a convex set is NP-hard in general [39]. To get a handle on these two problems, we exploit the fact that both objectives are in DC form, i.e., can be represented as with and both being convex. Linearizing at a given point yields a convex majorant of that is tight at that point. Minimizing the majorant and repeating yields an iterative procedure known as concave-convex procedure (CCCP, [56]) that falls into the more general framework of majorization-minimization (MM) algorithms [28]. When applied to and , this approach yields Algorithm ?.

For the second part of Algorithm ? to be practical, it is assumed that is convex. The above algorithms can be shown to yield strict descent until convergence to a limit point satisfying the first-order optimality condition of the problems /. This is the content of the next proposition.

An appealing feature of Algorithm ? is that solving each sub-problem in the repeat step only involves minor modifications of the computational approaches used for ERM resp. finding a feasible point in .

When selecting the parameter by means of a grid search, we suggest solving the associated instances of / from the smallest to the largest value of , using the solution from the current instance as initial iterate for the next one. For the smallest value of , we recommend using and any point from as initial iterate for respectively . Running Algorithm ? for formulation has the advantage that all iterates are contained in and thus enjoy at least the statistical guarantees of derived in Section 2. According to our numerical results, it is formulation that achieves better performance (cf. Section 6).

4.4Comparison to other regularization schemes

There exist a variety of other regularization schemes one may want to consider for the given problem, in particular those already mentioned in the introduction. The present subsection provides a more detailed overview on such alternatives, and we justify why we think that our proposal offers advantages.
Iterative Hard Thresholding [27]. This is an iterative algorithm whose iterates are given by the relation

where the projection operator can be evaluated in near-linear time [27]. As pointed out above, is typically not known. Numerical results suggest that the approach is sensitive to over-specification of (cf. ). A second issue is the choice of the step-sizes . [27] consider a constant step-size based on the Lipschitz constant of the gradient as normally used for projected gradient descent with convex constraint sets [1], as well as strategy developed in [26]. Neither of these has been supported by theoretical analysis. For squared loss, convergence guarantees can be established under the restricted isometry property (RIP) and a constant step-size depending on RIP constants, which are not accessible in practice.
Inverse -regularization [43]. [43] consider regularized risk estimation with the inverse of the -norm as regularizer, that is

As discussed above and illustrated by Figure ?, the function is sparsity-promoting on . While this function is non-convex, problem can be reduced to convex optimization problems (one for each coordinate, [43]). This is an appealing property, but the computational effort that is required to solve these convex optimization problem is out of proportion compared to most other approaches considered herein. In noiseless settings, computation is more manageable since does not need to be tuned. In our numerical studies, the performance is inferior to /.
Entropy regularization [49]. For , consider the Shannon entropy with the convention . While is a natural measure of sparsity on , there is a computational issue that makes it less attractive. It is not hard to show that the subdifferential [45] of is empty at any point in . Because of this, approaches that make use of (sub-)gradients of the regularizer such as projected (sub-)gradient descent or DC programming) cannot be employed. As already mentioned, coordinate descent is not an option given the simplex constraints. A workaround would be the use of for some . This leads to an extra parameter in addition to the regularization parameter, which we consider as a serious drawback. In fact, needs to be tuned as values very close to zero may negatively affect convergence of iterative algorithms used for optimization.
-regularization for . The regularizer for suffers from the same issue as the entropy regularizer: the subdifferential is empty for all points in . Apart from that, it is not clear how close needs to be to zero so that appropriately sparse solutions are obtained.
Log-sum regularization [7]. The regularizer for and is suggested in [7]. When restricted to , the above regularizer can be motivated from a Bayesian viewpoint as in terms of a specific prior on called pseudo-Dirichlet in [29]. The above regularizer is concave and continuously differentiable and can hence be tackled numerically via DC programming in the same manner as in Algorithm ?. Again, the choice of may have a significant effect on the result/convergence and thus needs to be done with care. Small values of lead to a close approximation of the -norm. While this is desired in principle, optimization (e.g. via Algorithm ?) becomes more challenging as the regularizer approaches discontinuity at zero.
Minimax concave penalty [57] and related regularizers. The minimax concave penalty (MCP) can be written as with

where are parameters. Since , the MCP reduces to under simplex constraints. Using the re-parameterization , we have

Under simplex constraints, can be restricted to . Note that for , , in which case the MCP boils down to the negative -regularizer suggested in the present work. On the other hand, as , the MCP behaves like the -norm. The MCP is concave and continuously differentiable so that the approach of Algorithm ? can be applied. Clearly, by optimizing the choice of the parameter , MCP has the potential to improve over the negative -regularizer in , which results as special case for . Substantial improvement is not guaranteed though and entails additional effort for tuning .

The SCAD [10] and capped -regularization [58] follow a design principle similar to that of MCP. All three are coordinate-wise separable and resemble the -norm for values close to zero and flatten out for larger values so as to serve as better proxy of the -norm than the -norm. A parameter (like for MCP) controls what exactly is meant by “close to zero” respectively “large”. Note the conceptual difference between this class of regularizers and negative -regularization: members of the former have been designed as concave approximations to the -norm at the level of a single coordinate, whereas the latter has been motivated as a proxy for the -norm on the simplex.
Convex envelope [21]. An approach completely different from those discussed so far can be found in [21]. A common way of motivating -regularization is that the -norm is the convex envelope (also known as bi-conjugate, [45]), i.e., the tightest convex minorant, of the -norm on the unit -ball. Instead of using the convex envelope of only the regularizer, [21] suggest that one may also use convex envelope of the entire regularized risk . The authors argue that this approach is suitable for promoting sparsity under simplex constraints. Moreover, the resulting optimization problem is convex so that computing a global optimum appears to be feasible. However, it is in general not possible to derive a closed form expression for the convex envelope of the entire regularized risk. In this case, alreay evaluating the convex envelope at a single point requires solving a convex optimization problem via an iterative algorithm. It is thus not clear whether there exist efficient algorithms for minimizing the resulting convex envelope.

5Extension to the matrix case

As pointed out in the introduction, there is a matrix counterpart to the problem discussed in the previous settings in which the object of interest is a low-rank Hermitian positive semidefinite matrix of unit trace. This set of matrices covers density matrices of quantum systems [38]. The task of reconstructing such density matrices from so-called observables (as e.g. noisy linear measurements) is termed quantum state tomography [40]. In the past few years, quantum state tomography based on Pauli measurements has attracted considerable interest in mathematical signal processing and statistics [16].
Specifically, the setup that we have in mind throughout this section is as follows. Let be the Hilbert space of complex Hermitian matrices with innner product , , and let henceforth , , denote the Schatten -’norm’ of a Hermitian matrix, defined as the -norm of its eigenvalues. In particular, denotes the number of non-zero eigenvalues, or equivalently, the rank. We suppose that the target is contained in the set , where i.e., is additionally positive semidefinite, of unit trace and has rank at most . In low-rank matrix recovery, the Schatten 1-norm (typically referred to as the nuclear norm) is commonly used as convex surrogate for the rank of a matrix [44]. Since the nuclear norm is constant over , one needs a different strategy to promote low-rankendess under that constraint. In the sequel, we carry over our treatment of the vector case to the matrix case. The analogies are rather direct; at certain points, however, an extension to the matrix case may yield additional complications as detailed below. To keep matters simple, we restrict ourselves to the setup in which the are such that

with . Equivalently,

where is a linear operator defined by , , . We consider squared loss, i.e., for the empirical risk reads

Basic estimators. As basic estimators, we consider empirical risk minimization given by , as well as , where is any point in the set

where is the adjoint of . Both and show adaptation to the rank of under the following restricted strong convexity condition. For , let be the tangent space of at . and let denote the projection on a subspace of .

The -RSC condition is weaker than the corresponding condition employed in [35], which in turn is weaker than the matrix RIP condition [44]. The next statement parallels Proposition ?, asserting that the constraint alone is strong enough to take advantage of low-rankedness.

Obtaining solutions of low rank. While may have low estimation error its rank can by far exceed the rank of , even though the extra non-zero eigenvalues of may be small. The simplest approach to obtain solutions of low rank is to apply thresholding to the spectrum of (the r.h.s. representing the usual spectral decomposition), that is , where for a threshold . Likewise, one may consider the following analog to weighted -regularization:

for non-negative weights as in the vector case. Note that the matrix of eigenvectors is kept fixed at the second stage; the optimization is only over the eigenvalues. Alternatively, one may think of optimization over with regularizer for eigenvalues of in decreasing order. However, from the point of view of optimization poses difficulties, possible non-convexity (depending on ) in particular.

Regularization with the negative -norm. One more positive aspect about the regularization scheme proposed in Section 4 is that it allows a straightforward extension to the matrix case, including the algorithm used for optimization (Algorithm ?). By contrast, for regularization with the inverse -norm, which can be reduced to convex optimization problems in the vector case, no such reduction seems to be possible in the matrix case. The analogs of / are given by

Algorithm ? can be employed for optimization mutatis mutandis. In the vector case and for squared loss, formulations and are comparable in terms of computational effort: each minimization problem inside the ’repeat’-loop becomes a quadratic respectively a linear program with a comparable number of variables/constraints. In the matrix case, however, appears to be preferable as the sub-problems are directly amenable to a proximal gradient method. By contrast, the constraint set in requires a more sophisticated approach.

Denoising. Negative -regularization in combination with the constraint set enforces solution of low rank as exemplified here in the special case of denoising of a real-valued matrix (i.e., ) contaminated by Gaussian noise. Specifically, the sampling operator , , here equals the symmetric vectorization operator, that is

The following proposition makes use of a result in random matrix theory due to [41].

In particular, for for some , the required lower bound on becomes , which is proportional to the noise level of the problem as follows from the proof of the proposition.

Alternative regularization schemes. The approaches of Section 4.4 can in principle all be extended to the matrix case by defining corresponding regularizers in terms of the spectrum of a positive semidefinite matrix. For example, the Shannon entropy becomes the von Neumann entropy [38]. It is important to note that in [25], the negative of the von Neumann entropy is employed, without constraining the target to be contained in . The negative von Neumann entropy still ensures positive semidefiniteness and boundedness of the solution. Schatten -regularization with for low rank matrix recovery is discussed in [34]. In a recent paper, [17] analyze statistical properties of the counterparts of MCP and SCAD in the matrix case. For details, we refer to the cited references and the references therein; our reasoning for prefering negative -regularization persists.

6Empirical results

We have conducted a series of simulations to compare the different methods considered herein and to provide additional support for several key aspects of the present work. Specifically, we study compressed sensing, least squares regression, mixture density estimation, and quantum state tomography based on Pauli measurements in the matrix case. The first two of these only differ by the presence respectively absence of noise. We also present a real data analysis example concerning portfolio optimization for NASDAQ stocks based on weekly price data from 03/2003 to 04/2008.

6.1Compressed sensing

We consider the problem of recovering from a small number of random linear measurements , where is standard Gaussian, . In short, with and having the as its rows. Identifying with a probability distribution on , we may think of the problem as recovering such distribution from expectations of the form . We here show the results for , and with (cf. Figure 1). The target is generated by selecting its support uniformly at random, drawing the non-zero entries randomly from and normalizing subsequently. This is replicated times for each value of .

The following approaches are compared for the given task, assuming squared loss .
’Feasible set’: Note that ERM here amounts to finding a point in . The output is used as initial iterate for ’L2’, ’weighted L1’, and ’IHT’ below.
’L2’: -norm maximization with , i.e., over

’Pilanci’: The method of [43] that maximizes the -norm over .
’weighted L1’: Weighted -norm minimization (cf. Section 3) over .
’IHT’: Iterative hard threshold under simplex constraints [27]. Regarding the step size used for gradient projection, we use the method in [26] which empirically turned out to be superior compared to a constant step size. ’IHT’ is run with the correct value of and is hence given an advantage.
Results. Figure 1 visualizes the fractions of recovery out of replications. A general observation is that the constraint is powerful enough to reduce the required number of measurements considerably compared to when using standard -minimization without constraints. At this point, we refer to [8] who gave a precise asymptotic characterization of this phenomenon in a “proportional growth” regime, i.e., and . When solving the feasibility problem, one does not explicitly exploit sparsity of the solution (even though the constraint implicitly does). Enforcing sparsity via ’Pilanci’, ’IHT’, ’L2’ further improves performance. The improvements achieved by ’L2’ are most substantial and persist throughout all sparsity levels. ’weighted L1’ does not consistenly improve over the solution of the feasibility problem.

Figure 1: Contour plots of the empirical relative frequencies of exact recovery in dependency of the number of measurements (horizontal axis) and s (vertical axis). The left and right plot show the contour levels .75 and .99, respectively. Note that the smaller the area left to and above the curve, the better the performance.
Contour plots of the empirical relative frequencies of exact recovery in dependency of the number of measurements (horizontal axis) and s (vertical axis). The left and right plot show the contour levels .75 and .99, respectively. Note that the smaller the area left to and above the curve, the better the performance.
Figure 1: Contour plots of the empirical relative frequencies of exact recovery in dependency of the number of measurements (horizontal axis) and (vertical axis). The left and right plot show the contour levels and , respectively. Note that the smaller the area “left” to and “above” the curve, the better the performance.

6.2Least squares regression

We next consider the Gaussian linear regression model with the as in the previous subsection. Put differently, the previous data-generating model is changed by an additive noise component. The target is generated as before, with the change that the subvector corresponding to is projected on to ensure sufficiently strong signal, where with and controlling the signal strength relative to the noise level . The following approaches are compared.
’ERM’: Empirical risk minimization.
’Thres’: ’ERM’ followed by hard thresholding (cf. Section 3).
’L2-ERM’: Regularized ERM with negative -regularization . For the parameter , we consider a grid of 100 logarithmically spaced points from to , the maximum eigenvalue of . Note that for , the optimization problem becomes concave and the minimizer must consequently be a vertex of , i.e., the solution is maximally sparse at this point, and it hence does not make sense to consider even larger values of . When computing the solutions , we use a homotopy-type scheme in which for each , Algorithm ? is initialized with the solution for the previous , using the output of ’ERM’ as initialization for the smallest value of .
’L2-D’: -norm maximization over with being the noise level defined above and . Algorithm ? is initialized with provided it is feasible. Otherwise, a feasible point is computed by linear programming.
’weighted L1’: The approach in . Regarding the regularization parameter, we follow [53] who let . We try 100 logarithmically spaced values between 0.1 and 10 for .
’IHT’: As above, again with the correct value of . We perform a second sets of experiments though in which is over-specified by different factors (, , ) in order to investigate the sensitivity of the method w.r.t. the choice of the sparsitity level.
’L1’: The approach , i.e., dropping the unit sum constraint and normalizing the output of the non-negative -regularized estimator . We use as recommended in the literature, cf. e.g. [36].
’oracle’: ERM given knowledge of the support .
For ’Thres’,’L2-ERM’ and other methods for which multiple values of a hyperparameter are considered, hyperparameter selection is done by minimizing the RIC as defined in Section 3 after evaluating each support set returned for a specific value of the hyperparameter.

Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance. Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance. Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance.
Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance. Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance. Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance.
Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance. Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance. Upper panel: Average estimation errors \lVert\widehat{\theta} - \beta^* \rVert_2 (\log_{10} scale) in dependence of n over 50 trials for selected values of s. Here, \widehat{\theta} is a placeholder for any of the the estimators under consideration. Middle and Lower panel: contour plots of the average Matthew’s correlation in dependence of n (horizontal axis) and s (vertical axis) for the contour levels 0.7, 0.8, 0.9, 0.95. Note that the smaller the area between the lower left corner of the plot and a contour line of a given level, the better the performance.

The results are summarized in Figures ? and ?. Turning to the upper panel of Figure ?, the first observation is that ’L1’ yields noticeably higher estimation errors than ’ERM’, which yields a reductions roughly between a factor of and . A further reduction in error of about the same order is achieved by several of the above methods. Remarkably, the basic two-stage methods, thresholding and weighted -regularization for the most part outperform the more sophisticated methods. Among the two methods based on negative -regularization, ’L2-ERM’ achieves better performance than ’L2-D’. We also investigate sucess in support recovery by comparing and , where represents any of the considered estimators, by means of Matthew’s correlation coefficient (MCC) defined by

with , etc. denoting true positives, false negatives etc. The larger the criterion, which takes values in , the better the performance. The two lower panels of Figure ? depict the MCCs in the form of contour plots, split by method. The results are consistent with those of the -errors. The performance of ’weighted L1’ and ’thres’ improves respectively is on par with that of ’IHT’ which is provided the sparsity level. Figure ? reveals that this is a key advantage since the performance drops sharply as the sparsity level is over-specified by an increasing extent.

Sensitivity of ’IHT’ w.r.t. the choice of s. The plots display error curves of IHT run with the correct value of s as appearing in Figure  as well with overspecification of s by the factors 1.2, 1.5, 2. The drop in performance is substantial: for 2s, the improvement over ERM (here used as a reference) is only minor. Sensitivity of ’IHT’ w.r.t. the choice of s. The plots display error curves of IHT run with the correct value of s as appearing in Figure  as well with overspecification of s by the factors 1.2, 1.5, 2. The drop in performance is substantial: for 2s, the improvement over ERM (here used as a reference) is only minor. Sensitivity of ’IHT’ w.r.t. the choice of s. The plots display error curves of IHT run with the correct value of s as appearing in Figure  as well with overspecification of s by the factors 1.2, 1.5, 2. The drop in performance is substantial: for 2s, the improvement over ERM (here used as a reference) is only minor.

6.3Density estimation

Let us recall the setup from the corresponding bullet in Section 1. For simplicity, we here suppose that the are i.i.d. random variables with density , where for , and is a given collection of densities. Specifically, we consider univariate Gaussian densities , where contains mean and standard deviation, . As an example, one might consider locations and different standard deviations per location so that , i.e., , , and . This construction provides more flexibility compared to usual kernel density estimation where the locations equal the data points, a single bandwidth is used, and the coefficients are all . For large , sparsity in terms of the coefficients is common as a specific target density can typically be well approximated by using an appropriate subset of of small cardinality.
As in [4], we work with the empirical risk

and , where for , such that with .
In our simulations, we let , , , . The locations are generated sequentially by selecting randomly from , from etc. where is chosen such that the ’correlations’ for all corresponding to different locations. An upper bound away from is needed to ensure identifiability of from finite samples. Data generation, the methods compared, and the way they are run is almost identical to the previous subsections. Slight changes are made for (still uniformly at random, but it is ruled that any location is selected twice), ( set to ) and hyperparameter selection. For the latter, a separate validation data set (also of size ) is generated, and hyperparameters are selected as to minimize the empirical risk from the validation data.
Results. Figure ? confirms once again that making use of simplex constraints yields markedly lower error than -regularization followed by normalization [4]. ’L2-ERM’ and weighted perform best, improving over ’IHT’ (which is run with knowledge of ).

6.4Portfolio Optimization

We use a data set available from http://host.uniroma3.it/docenti/cesarone/datasetsw3_tardella.html containing the weekly returns of stocks in the NASDAQ index collected during 03/2003 and 04/2008 (264 weeks altogether). For each stock, the expected returns is estimated as the mean return from the first four years (208 weeks). Likewise, the covariance of the returns is estimated as the sample covariance of the returns of the first four years. Given and , portfolio selection (without short positions) is based on the optimization problem

where is a parameter controlling the trade-off between return and variance of the portfolio. Assuming that has a unique maximum entry, is defined as the smallest number such that the solution of has exactly one non-zero entry equal to one at the position of the maximum of . As observed in [3], the solution of tends to be sparse already because of the simplex constraint. Sparsity can be further enhanced with the help of the strategies discussed in this paper, treating as the empirical risk. We here consider ’L2-ERM’, ’weighted L1’, ’Thres’ and ’IHT’ for a grid of values for the regularization parameter (’L2-ERM’ and ’weighted L1’) respectively sparsity level (’L2-ERM’ and ’Thres’). The solutions are evaluated by computing the Sharpe ratios (mean return divided by the standard deviation) of the selected portfolios on the return data of the last 56 weeks left out when computing and .

Sharpe ratios of the portfolios selected by ’L2-ERM’, ’weighted L1’, ’Thres’ and ’IHT’ on the hold-out portion of the NASDAQ data set in dependency of different choices for ther regularization parameter/sparsity level (to allow for joint display, we use the \ell_2-norm as measure of sparsity on the horizontal axis). Left panel: \tau = 10^{-4}, Right panel: \tau = 5 \cdot 10^{-3}, cf. . The results of ’Thres’ and ’IHT’ are essentially indistinguishable and are hence not plotted separately for better readability. Note that points that are too far away from each other with respect to the horizontal axis are not connected by lines. Sharpe ratios of the portfolios selected by ’L2-ERM’, ’weighted L1’, ’Thres’ and ’IHT’ on the hold-out portion of the NASDAQ data set in dependency of different choices for ther regularization parameter/sparsity level (to allow for joint display, we use the \ell_2-norm as measure of sparsity on the horizontal axis). Left panel: \tau = 10^{-4}, Right panel: \tau = 5 \cdot 10^{-3}, cf. . The results of ’Thres’ and ’IHT’ are essentially indistinguishable and are hence not plotted separately for better readability. Note that points that are too far away from each other with respect to the horizontal axis are not connected by lines.

Results. Figure ? displays the Sharpe ratios of the portfolios returned by these approaches in dependency of the -norms of the solutions corresponding to different choices of the regularization parameter respectively sparsity level and two values of in . One observes that promoting sparsity is beneficial in general. The regularization-based methods ’L2-ERM’ and ’weighted L1’ differ from ’L2-ERM’ and ’Thres’ (whose results are essentially not distinguishable) in that the former two yield comparatively smooth curves. ’L2-ERM’ achieves the best Sharpe ratios for a wide range of -norms for both values of .

6.5Quantum State Tomography

We now turn to the matrix case of Section 5. The setup of this subsection is based on model , where the measurements are chosen uniformly at random from the (orthogonal) Pauli basis of (here, for some integer ). For , the Pauli basis of is given by the following four matrices:

For , the Pauli basis is constructed as the -fold tensor product of