New Computational and Statistical Aspects of Regularized Regression with Application to Rare Feature Selection and Aggregation

# New Computational and Statistical Aspects of Regularized Regression with Application to Rare Feature Selection and Aggregation

Amin Jalali Technicolor AI Lab. Email: amin.jalali@technicolor.com.    Adel Javanmard Marshall School of Business, University of Southern California. Email: ajavanma@usc.edu.    Maryam Fazel Department of Electrical Engineering, University of Washington. Email: mfazel@uw.edu.
April 2019
###### Abstract

Prior knowledge on properties of a target model often come as discrete or combinatorial descriptions. This work provides a unified computational framework for defining norms that promote such structures. More specifically, we develop associated tools for optimization involving such norms given only the orthogonal projection oracle onto the non-convex set of desired models. As an example, we study a norm, which we term the doubly-sparse norm, for promoting vectors with few nonzero entries taking only a few distinct values. We further discuss how the K-means algorithm can serve as the underlying projection oracle in this case and how it can be efficiently represented as a quadratically constrained quadratic program. Our motivation for the study of this norm is regularized regression in the presence of rare features which poses a challenge to various methods within high-dimensional statistics, and in machine learning in general. The proposed estimation procedure is designed to perform automatic feature selection and aggregation for which we develop statistical bounds. The bounds are general and offer a statistical framework for norm-based regularization. The bounds rely on novel geometric quantities on which we attempt to elaborate as well.

[.]

Keywords: Convex geometry, Hausdorff distance, structured models, combinatorial representations, K-means, regularized linear regression, statistical error bounds, rare features.

## 1 Introduction

A large portion of estimation procedures in high-dimensional statistics and machine learning have been designed based on principles and methods in continuous optimization. In this pursuit, incorporating prior knowledge on the target model, often presented as discrete and combinatorial descriptions, has been of interest in the past decade. Aside from many individual cases that have been studied in the literature, a number of general frameworks have been proposed. For example, (Bach et al., 2013; Obozinski and Bach, 2016) define sparsity-related norms and their associated optimization tools from support-based monotone set functions. On the other hand, several unifications have been proposed for the purpose of providing estimation and recovery guarantees. A well-known example is the work of (Chandrasekaran et al., 2012) which connects the success of norm minimization in model recovery given random linear measurements to the notion of Gaussian width (Gordon, 1988). However, many of the final results of these frameworks (excluding discrete approaches such as (Bach et al., 2013)) are quantities that are hard to compute; even evaluating the norm. Therefore, many a time computational aspects of these norms and their associated quantities are treated on a case by case basis. In fact, a unified framework for turning discrete descriptions into continuous tools for estimation, that 1) provides a computational suite of optimization tools, and 2) is amenable to statistical analysis, is largely underdeveloped.

Consider a measurement model , where is the design matrix and is the noise vector. Given combinatorial descriptions of the underlying model, say , in addition to and , much effort and attention has been dedicated to understanding constrained estimators for recovery. For example, only assuming access to the (non-convex) projection onto the set of desired models  enables devising a certain class of recovery algorithms constrained to ; Iterative Hard Thresholding (IHT) algorithms, (Blumensath and Davies, 2008, Section 3) (Blumensath, 2011) (projects onto the set of -sparse vectors), (Jain et al., 2010, Section 2) (projects onto the set of rank- matrices), (Roulet et al., 2017) (does 1-dimensional K-means which is projection onto the set of models with distinct values), belong to this class. However, a major subset of estimation procedures focus on norms, designed based on the non-convex structure sets, for estimation. Working with convex functions, such as norms, for promoting a structure is a prominent approach due to its flexibility and robustness. Namely, the proposed norms can be used along with different loss functions and constraints111This is in contrast to the specific constrained loss minimization setups required in IHT.. In addition, the continuity property of these functions allows the optimization problems to take into account points that are near (but not necessarily inside) the structure set; a soft approach to specifying the model class. The seminal work of (Chandrasekaran et al., 2012) provides guarantees for norm minimization estimation, constrained with or , using the notion of Gaussian width. Dantzig selector is another popular category of constrained estimators studied in the literature (e.g., (Chatterjee et al., 2014)) but other variations also exist (Section 7 provides a list). In analyzing all of these constrained estimators, the tangent cone, at the target model with respect to the norm ball, is the determining factor for recoverability. Then, the notion of Gaussian width of such cone (Chandrasekaran et al., 2012; Gordon, 1988) allows for establishing high probability bounds for recovery from many random ensembles of design. In a way, the Gaussian width, or a related quantity known as the statistical dimension (Amelunxen et al., 2014), are local quantities that can be understood as an operational method for gauging the model complexity with respect to the norm and determining the minimal acquisition requirements for recovery from random linear measurements.

However, regularized estimators pose further challenges for analysis. More specifically, consider

 ˆβ ≡ argminβ  12n∥y−Xβ∥22+λ∥β∥ (1.1)

where is the regularization parameter. From an optimization theory perspective, for a fixed design and noise, (1.1) and a norm minimization problem constrained with (see (7.2) and (7.3)) are equivalent if a certain value of , corresponding to , is being used; meaning that for these estimators will be equal. However, the mapping between theses problem parameters is in general complicated (e.g., see (Aravkin et al., 2016)) which renders the aforementioned equivalence useless when studying error bounds that are expressed in terms of these problem parameters (e.g., see bounds in Theorem 5.1 and their dependence on ). Furthermore, in the study of expected error bounds for a family of noise vectors (or design matrices), such equivalence is in general irrelevant (e.g., fixing , each realization of noise will imply a different corresponding to the given value of ). Nonetheless, a good understanding of regularized estimators with decomposable norms have been developed; see (Negahban et al., 2012; Candes and Recht, 2013; Foygel and Mackey, 2014; Wainwright, 2014; Vaiter et al., 2015) for slightly different definitions. These are norms with a special geometric structure and only a handful of examples are known (including the norm and the nuclear norm). In regularization with general norms, it is possible to provide a high-level analysis, inspired by the analysis for decomposable norms, and provide error bounds; e.g., see (Banerjee et al., 2014) and follow up works. However, the proposed bounds are in a way conceptual and no general computational guidelines for evaluating these bounds exist. In this work, we introduce a geometric quantity for gauging model complexity with respect to a norm in regularized estimation. Such quantity, accompanied by a few computational guidelines and connections to the rich literature on convex geometry, then allows for principled approach towards evaluating the previous conceptual error bounds leading to our final statistical characterizations for (1.1) that are sensitive to 1) norm-induced properties of design, and 2) non-local properties of the model with respect to the norm.

A motivation behind our pursuit of a computational and statistical framework for regularization is to handle the presence of many rare features in real datasets, which has been a challenging proving ground for various methods within high-dimensional statistics, and in machine learning in general; see Section 2 for further motivation. In this work, we study an efficient estimator, namely a regularized least-squares problem, for automatic feature selection and aggregation and develop statistical bounds. The regularization, an atomic norm proposed by (Jalali and Fazel, 2013), poses new challenges for computation (even norm evaluation) and statistical analysis (e.g., non-decomposability). We extend the computational framework provided in (Jalali and Fazel, 2013) for this norm, in Section 3.4, and provide statistical error bounds in Section 8. We also establish advantages over Lasso (Section 8.2). Moreover, our estimation and prediction error bounds, rely on simple geometric notions to gauge condition numbers and model complexity with respect to the norm. These bounds are quite general and go beyond regularization for feature selection and aggregation.

### 1.1 Summary of Technical Contributions

In this work, we consider regularized regression in the presence of rare features (presented in Section 2) as our main case study. In our attempt to address this problem, we develop several general results for defining norms from given combinatorial descriptions and for statistical analysis of norm-regularized least-squares, as summarized in the following:

1. We adopt an approach to defining norms from given descriptions of desired models, and provide a unified machinery to derive useful quantities associated to a norm for optimization (e.g., the dual norm, the subdifferential of the norm and its dual, the proximal mapping, projection onto the dual norm ball, etc); see Section 3. Our approach relies on the non-convex orthogonal projection onto the set of desired models. In Section 3.4, we discuss how a discrete algorithm such as K-means clustering can be used to define a norm, namely the doubly-sparse norm, for promoting vectors with few nonzero entries taking only a few distinct values. Our results extend those of (Jalali and Fazel, 2013) to any structure.

Complementing the existing statistical analysis approaches, for least-squares regularized with any norm, we take a variational approach, through quadratic functions, to understanding norms and provide alternative error bounds that can be easier to interpret, compute, and generalize:

1. We provide a prediction error bound in terms of norm-induced aggregation measures of the design matrix for when the noise satisfies the convex concentration property or is a subgaussian random vector. We do this by making a novel use of the Hanson-Wright inequality for when the dual to the norm has a concise variational representation; Section 4. The new bounds are deterministic with respect to the design matrix, are interpretable, and allow for taking detailed information on the design matrix into account, going beyond results on well-known random ensembles which might be unrealistic in real applications.

Most of the existing estimation bounds for norm-regularized regression can be unified under the notion of decomposability; see (Negahban et al., 2012; Candes and Recht, 2013; Foygel and Mackey, 2014; Vaiter et al., 2015) for slightly different definitions. Our results, in contrast, do not rely on such assumption:

1. In gauging model complexity with respect to the regularizer, we introduce a novel geometric measure, termed as the relative diameter, which then allows for simplified derivations for restricted eigenvalue constants and prediction error bounds. More specifically, we go beyond decomposability and we provide techniques to compute such complexity measure (Section 6). We provide calculations for a variety of norms (e.g., ordered weighted norms) used in the high-dimensional statistics literature; Section 6 and Appendix F. In Section 7, we provide further insight into the notion of relative diameter and compare with existing quantities in the literature. Through illustrative examples, we showcase the sensitivity of the relative diameter to the properties of the model and the norm.

Finally, we use the aforementioned developments to design and analyze a regularized least-squares estimator for regression in the presence of rare features:

1. We propose to use doubly-sparse regularization for regression in the presence of rare features (Section 2.1). We discuss how such choice allows for automatic feature aggregation. We use the insights and tools we develop in the paper for regression with rare features and establish the advantage of regularizing the least-squares regression with the doubly-sparse norm, given in (2.2), over Lasso, in Section 8.

Last but not least, we provide various characterizations related to a number of norms common in the high-dimensional statistics literature such as the ordered weighted norms (commonly used for simultaneous feature selection and aggregation; e.g., see (Figueiredo and Nowak, 2016).) which could be of independent interest. See Section 6.2 and Appendix F. Proof of technical lemmas are deferred to Appendices.

##### Notations.

Denote by and the operator norm and the Frobenius norm of a matrix . We also represent its smallest and largest singular values by and . For a positive integer , we denote by the set . For a compact set , the polar set is denoted by . For a positive integer , we denote by the -dimensional unit sphere, . Given a set , we denote by the convex hull of , i.e., . Moreover, define . In addition, given a compact set , a point is an extreme point of if for implies . Denote by and the vectors of all ones and all zeros in , respectively. We may drop the subscripts when clear from the context. For two vectors , their Hadamard (entry-wise) product is denoted by where for . The unit simplex in is denoted by . The full unit simplex is denoted by . In all of this work, we assume the model ( or ) is nonzero.

## 2 Motivation: Regularized Regression for Rare Features

Data sparsity has been a challenging proving ground for various methods. Sparse sensing matrices in the established field of compressive sensing (Berinde et al., 2008), the inherent sparsity of document-term matrices in text data analysis (Wang et al., 2010), the ubiquitous sparsity of biological data, from gut microbiota to gene sequencing data, and the sparse interaction matrices in recommendation systems, have been challenging the established methods that otherwise have provable guarantees when certain well-conditioning properties (e.g., the restricted isometry property in compressive sensing) hold. See (Yan and Bien, 2018) for further motivations.

A common approach when lots of rare features are present is to remove the very rare features in a pre-processing step (e.g., Treelets by (Lee et al., 2008)). This is not efficient as it may discard large amount of information and better approaches are needed to make use of the rare features to boost estimation and predictive power. On the other hand, there have been efforts for establishing success of minimization in case of certain sparse sensing matrices (e.g., see (Berinde et al., 2008; Berinde and Indyk, 2008; Gilbert and Indyk, 2010)) where gaps between their statistical requirements and information-theoretical limits exist. Combinatorial approaches for subset selection, through integer programming, have also been restricted to certain sparse design matrices to achieve polynomial-time recovery (Del Pia et al., 2018). Instead, a variety of ad-hoc methods, based on solving different optimization programs, have been proposed for going beyond sparse models and making use of rare features (Bondell and Reich, 2008; Zeng and Figueiredo, 2014) and there has been a recent interest in this problem within the high-dimensional statistics community. While some of these estimators come with a statistical theory, they may require extensive prior knowledge (Li et al., 2018; Yan and Bien, 2018) which could be expensive or difficult to gather in real applications.

### 2.1 Our Approach: Doubly-Sparse Regularization

We approach this problem through feature aggregation, but unlike previous works, we do so in an automatic fashion at the same time as estimation. More specifically, in learning a linear model from noisy measurements, we use the model proposed by (Jalali and Fazel, 2013): we are interested in vectors that are not only sparse (to be able to ignore unnecessary features) but also have only a few distinct values, which induces a grouping among features and allows for automatic aggregation. We refer to this prior as double-sparsity and elaborate on it in the sequel. Considering the structure norm (see (Jalali and Fazel, 2013), Section 3.1, or Section 3.4) corresponding to this prior, we study a regularized least-squares optimization program in (2.2). Since the existing machinery of atomic norms (Chandrasekaran et al., 2012) does not come with tools for optimization, we develop new tools in Section 3 that can be used to efficiently compute and analyze the proposed estimator. Superior performance over the use of regularization (Lasso) in the presence of rare features is showcased in Section 8.

#### 2.1.1 The Prior and the Regularization

A -sparse vector can be expressed as a linear combination of indicator functions for singletons in ; i.e., where . In contrast, we are interested in vectors that can be expressed as a linear combination of a few indicator functions using a coarse partitioning of ; i.e., where partition and is small. Here, ’s can be zero; i.e., we are allowing to be one of the distinct values. To combine the two priors, for two fixed values , one can consider vectors where are non-empty and disjoint and . Those are the vectors with at most nonzero values where the top entries have at most distinct values. Finally, to make the prior more suitable for our regression setting, we allow for arbitrary sign patterns within each part.

Given a vector denote by the sorted version of in descending order; i.e., . Then, we consider

 Sk,d≡{β: card(β)≤k, |{¯β1,…,¯βk}|≤d}; (2.1)

the vectors with at most nonzero values whose top absolute values take at most distinct values. Figure 1 illustrates an example. See (Jalali and Fazel, 2013) for further detail and existing works around this idea. With the aid of the machinery presented in Section 3, we can define a norm, referred to as the doubly-sparse norm, that can help in recovery of models from in a sense characterized by our statistical error bounds. For two fixed values , we refer to this norm as the -norm, denoted by .

#### 2.1.2 A Statistical Analysis

Consider a measurement model , where is the design matrix and is the noise vector. We then consider the following estimator,

 ˆβ≡argminβ  12n∥y−Xβ∥22+λ∥β∥k□d, (2.2)

where is the regularization parameter. In Section 8, we analyze (2.2) and provide prediction error bounds, namely bounds for .

More generally, we consider regularization with any norm. In providing a prediction error bound, we show how norm-specific aggregation measures can be used to bound the regularization parameter (Section 4). For estimation error, we provide a general tight analysis through the introduction of relative diameter (Section 5.2, Section 6, and Section 7). We make partial progress in computing the relative diameter, namely we do so for , and its dual, but we also provide computations for some important classes of polyhedral norms to showcase possible strategies; for ordered weighted norms studied in (Figueiredo and Nowak, 2016), and, for weighted and norms. See Section 6.2 and Appendix F for details of computations.

#### 2.1.3 Optimization Procedures

In computing from (2.2), or more generally (1.1), one can use different optimization algorithms. While might seem complicated to even be evaluated, we show in Section 3.4 that there exists an efficient procedure for computing its proximal mapping (for a definition, see Equation 3.11, and for a characterization in the case of , see Section 3.4.3). Therefore, here, we only discuss two proximal-based optimization strategies to illustrate the computational efficiency of the estimator in (2.2). The optimization program in (2.2) is unconstrained and its objective is convex and the sum of a smooth and a non-smooth term. Therefore, as we have access to the proximal mapping associated to the non-smooth part, proximal gradient algorithm seems like a natural choice for optimization. For , we compute

 gt=1nXTXβt−XTy ,  βt+1=prox(βt−ηtgt;∥⋅∥) (2.3)

where is the step size. The algorithm, with an appropriate choice of step size, reaches an -accurate solution (in prediction loss) in steps. See (Parikh et al., 2014) for further details on proximal algorithms.

As we will see later, the proximal mapping is the solution to a convex optimization program and may not admit a closed form representation unlike simple norms such as the norm (whose proximal mapping is soft-thresholding). Therefore, it might be inevitable to work with approximate solutions. In such case, inexact proximal methods (Schmidt et al., 2011) may be employed which allow for a controlled inexactness in computation of the proximal mapping (more specifically, inexactness in the objective) but provide similar convergence rates as in the exact case.

Alternating Direction Method of Multipliers (ADMM) may also be used to solve (2.2), similar to the discussions in Section 6.4 of (Boyd et al., 2011) for the norm. The non-trivial ingredient of such strategy is the proximal mapping for the regularizer, which is available here.

While this paper is concerned with the regularized estimation, it is worth mentioning that the ability to compute the proximal mapping also enables solving the generalized Dantzig selector (defined in (7.1)) as discussed in (Chatterjee et al., 2014).

## 3 Projection-based Norms

Given a compact set of desired model parameters, which is symmetric, spans , and none of its members belongs to the convex hull of the others, the atomic norm framework (Bonsall, 1991; Chandrasekaran et al., 2012) defines a norm through

 ∥β∥A=inf{∑ω∈Acω:β=∑ω∈Acωω,cω≥0}. (3.1)

This optimization problem is hard to solve in general and one might end up with linear programs that are difficult to solve or might have to resort to discretization (e.g., (Shah et al., 2012)) or to case-dependent reformulations (e.g., (Tang et al., 2013)).

Alternatively, one might consider the dual norm as the building block for further computations: the support function to the norm ball or to the atomic set, namely

 ∥θ∥⋆A≡sup∥β∥A≤1 ⟨β,θ⟩=supa∈A ⟨a,θ⟩. (3.2)

Assuming , using the above variational characterization, and , we get

 dist2(θ,A)=1+∥θ∥22−2∥θ∥⋆A. (3.3)

While the dual norm is 1-homogeneous, the other terms above are not, which limits the uses of this expression. As evident from the result of Proposition 3.1, homogenizing the atomic set into provides a better object to work with. Next, we elaborate on this direction and provide a framework for defining norms that comes with a computational suite for computing various quantities associated to these norms.

Some of the material in Section 3.1 and Section 3.4 have been previously mentioned in (Jalali and Fazel, 2013) without proof and restricted to the so-called -valued models. We generalize this framework and use it for addressing the problem of interest in Section 2.

### 3.1 Definition and Characterizations

Given a closed set that is scale-invariant (closed with respect to scaling by any which make it symmetric with respect to the origin as well) and spans , consider an associated convex set defined as

 BS=conv{β:β∈S,∥β∥2=1}. (3.4)

Since is a symmetric compact convex body with the origin in its interior, the corresponding symmetric gauge function is defined as

 ∥β∥S≡inf{γ>0: β∈γBS}, (3.5)

is a norm with as the unit norm ball. One can view as an atomic norm with atoms given by the extreme points of the unit norm ball as . Using atoms, we can express as in (3.1) with . As we will see later, if and only if .

As an alternative to (3.2), Proposition 3.1 provides a way to compute the dual to this norm. Denote by

 Π(θ;S)=ΠS(θ)=argminβ{∥θ−β∥2: β∈S}

the (non-convex) orthogonal projection onto . Note that the projection mapping onto a non-convex set is set-valued in general. We refer to Appendix A for further details and proofs of the following statements.

###### Proposition 3.1 (The Dual Norm).

Given any closed scale-invariant set which spans , the dual norm to is given by

 ∥θ∥⋆S≡sup∥β∥S≤1⟨β,θ⟩=∥Π(θ;S)∥2 (3.6)

where refers to the norm of any member of the set and is well-defined. Moreover,

 ⟨θ,Π(θ;S)⟩=∥Π(θ;S)∥22=∥Π(θ;S)∥S∥θ∥⋆S (3.7)

which illustrates the pair of achieving vectors in the definition of dual norm and yields

 (∥θ∥⋆S)2=∥θ∥22−dist2(θ,S). (3.8)

Figure 2 illustrates Proposition 3.1. Equation 3.7 is also known as the alignment property in the literature. In contrast with (3.3), the expression in (3.8) is 2-homogeneous in . With the above characterization for the dual norm we get

 ∥β∥S=sup{⟨β,θ⟩: ∥Π(θ;S)∥2≤1}. (3.9)

Since the optimal in the definition of the dual norm in (3.6) is known to be , we can easily characterize the subdifferential as in the following.

###### Lemma 3.2 (Subdifferential of dual norm).

The subdifferential of the dual norm at is given by

 ∂∥β∥⋆S=1∥ΠS(β)∥2conv(ΠS(β))

which in turn implies .

Proof of Lemma 3.2 is given in Appendix A.

While an oracle that computes the projection enables us to carry out many computations for quantities related to the structure norm (e.g., the value of , the proximal operators for the norms and squared norms, as well as projection onto , as discussed in the rest of this section), some properties of the structure set can highly simplify these computations. In the following, we consider the invariance properties of the structure (under permutations and sign changes) and in Lemma 3.6, we discuss monotonicity properties of the structure. Lemma 3.3 is not entirely new and has been discussed in the literature in one form or another.

###### Lemma 3.3 (Invariance in Projection).

Consider a closed set , convex or non-convex, and the orthogonal projection mapping . Then,

• Provided that is closed under a change of signs of entries (i.e., implies for any sign vector ) then for any .

• Provided that is closed under permutation of entries (i.e., implies for any permutation operator ) then and any have the same ordering: implies for all .

Proof of Lemma 3.3 is given in Appendix A.

### 3.2 Examples

In the following, we provide a few examples of structure norms, both existing and new;

• projection of onto , where is the -th standard basis vector, is the set of all with . The length of such projections is indeed the norm which is dual to norm.

• When is the set of all rank-1 matrices, projection onto is the principal component and its length is the largest singular value of the matrix, the operator norm.

• For structure norms defined based on , given in (2.1), see Section 3.4. Figure 3 provides a schematic of this family of norms, for different values of and , as well as their dual norms.

• consider satisfying and where is the set of signed permutation matrices. As established in Lemma F.10, we have

 ∥β∥S≡∥w∥2⋅∥β∥⋆w

where is the ordered weighted norm associated to . Projection onto requires sorting the absolute values of the input vector.

• As another example, consider where is the set of signed permutation matrices. Given a matrix , its projection onto can be derived by projecting onto where is the set of permutation matrices. However, we already know efficient algorithms for finding the nearest permutation matrix (without a scaling factor ); algorithms for solving the assignment problem such as the Hungarian method. Lemma 3.4 establishes that these two solutions are related.

###### Lemma 3.4.

We have . In other words, one can project onto and later find the correct scaling of the projected point to get .

Proof of Lemma 3.4 is given in Appendix A. The above is also helpful in making use of in place of in greedy algorithms such as the one studied in (Tewari et al., 2011).

### 3.3 Quantities based on a Representation

Note that while the dual norm (or its subdifferential, characterized in Lemma 3.2) can be directly computed from the projection, computation of quantities such as the norm value in (3.9), or objects we discuss next, namely the projection onto the dual norm ball, the proximal mapping for the norm, or the subdifferential for the norm, could greatly benefit from a representation of the projection onto the structure which can then be plugged into the aforementioned optimization programs. For the structure considered in Section 3.4, we have access to an efficient representation for the dual norm in terms of a quadratically constrained quadratic program (QCQP).

The subdifferential of a norm is useful in devising subgradient-based algorithms and can be computed via

 (3.10)

Alternatively, consider the proximal mapping associated to which is defined as the unique solution to the following optimization program,

 prox(β;∥⋅∥)≡argminθ 12∥β−θ∥22+∥θ∥. (3.11)

The proximal mapping enables a wide range of optimization strategies that are commonly more efficient that subgradient-based methods; e.g., (Parikh et al., 2014). For example, in Section 2.1.3, we briefly mentioned proximal gradient descent as well as ADMM for solving the regularized least-squares problem (2.2) or (1.1) assuming an efficient routine for evaluating the proximal mapping.

The proximal mapping admits a closed form solutions for simple cases such as the norm or the nuclear norm; soft-thresholding. However, more generally it can be computed through projection onto the dual norm ball, namely as

 prox(β;∥⋅∥)=β−argminθ{∥β−θ∥22: ∥θ∥⋆≤1}. (3.12)

For computing (3.10) or (3.12), one may express the dual norm ball as where . Therefore, the proximal mapping may be computed through

 prox(β;∥⋅∥)=β−argminθ{∥β−θ∥22: ⟨~β,θ⟩≤1  ∀~β∈B}.

Since may have an infinite number of elements, or exponentially-many, it is not straightforward to solve such a quadratic optimization problem especially in each iteration of another algorithm such as proximal gradient descent or ADMM described in Section 2.1.3. Therefore, a more efficient representation of the dual norm ball could enable an efficient computation of the proximal mapping, subgradients, etc.

##### Black-box versus Representable.

In the case of structure norms, namely , we have (by assumption) an efficient routine to evaluate the projection onto which allows us to check membership (feasibility) in . Optimization (for (3.10) or (3.12)) given only a feasibility oracle is still not easy. However, in cases such as , it is possible to derive an efficient representation for the projection onto and the dual norm, which can then replace the dual norm ball membership constraints and yield the objects of interest (subgradients or the proximal mapping) as solutions to manageable convex optimization programs. More concretely, assume we can establish

 ∥θ∥⋆=minu{f(θ,u): (θ,u)∈T} (3.13)

where is a finite-dimensional convex set and is a convex function. Then, the proximal mapping can be expressed as

 prox(β;∥⋅∥)=β−argminθ{∥β−θ∥22: f(θ,u)≤1, (θ,u)∈T}.

Deriving a representation as in (3.13) is the main focus of Section 3.4 for ; given in Lemma 3.12.

### 3.4 Doubly-sparse Norms (k□d-norm)

Here, we discuss a structure motivated by the statistical estimation problem at hand, namely regression in the presence of rare features. As we show, a fast discrete algorithm, namely the 1-dimensional K-means algorithm, can be used to define a norm for feature aggregation as well as for computing its optimization-related quantities.

For two fixed values , the structure set in (2.1) is scale-invariant and spans . Therefore, we consider the structure norm associated to to which we refer as the -norm and we denote by , or when clear from the context. Specifically,

 ∥β∥k□d≡inf{γ>0: β∈γBSk,d}, (3.14)

with . According to Proposition 3.1, we have , and in turn, . Next, we address the computational aspects.

#### 3.4.1 Examples; for Different Values of k and d

It is clear from (2.1) that for : since is fixed, if then . Therefore, for any .

###### Remark 3.5.

Note that a similar monotonicity does not hold with respect to . Consider . If then . However, if , the addition of elements to the set may increase the number of distinct values by . Therefore, for any .

However, with and , the addition of the extra zero elements do not change , and we get for any . The new definition differs from (2.1) in not counting zero as a separate value among the top entries. For example, the dual norm corresponding to is .

Nonetheless, we have . It is worth noting that for any , coincides with the -support norm (Argyriou et al., 2012). Furthermore, Lemma F.13 (Item 1) establishes that

 ∥β∥k□1=max{1√k∥β∥1,√k∥β∥∞}. (3.15)

As a corollary, we get . See Figure 3 for a full picture for and .

#### 3.4.2 The Projection and its Combinatorial Representation

Before discussing the projection onto , in Lemma 3.9, we state a lemma to establish a reduction principle that allows simplifying such projection. This reduction makes use of the invariance and monotonicity properties for such projection. We established the former in Lemma 3.3. For the latter, Lemma 3.6 can be thought of as an implication of the Occam razor principle. In simple terms, if the characteristic property that defines a structure ignores zero values, the projected vector will have a support included in the support of original vector; there is no need to have new values in those places when computing the projection. Similarly, if the characteristic property treats similar values as one value, there is no need to map them to distinct values in the projection. These suggest that we can always consider problems in a reduced space; only considering non-zero entries and distinct values in our structure of interest, namely .

###### Lemma 3.6 (Monotonicity).

Consider a closed scale-invariant set that spans . Moreover, consider any orthogonal projection . We have:

• If implies for all , then for any ; i.e., implies for any and any .

More generally, consider an orthogonal projection matrix . If (i) implies , and, (ii) , then, implies .

• If implies for all , then implies for any .

More generally, consider a pair of oblique projection matrices, i.e., and , satisfying . Assume , and that implies . Then, for any , we have .

###### Lemma 3.7.

If is sign and permutation invariant and , then for all we have whenever .

###### Lemma 3.8.

For a given , consider (where is arbitrary from ) and a permutation for which is sorted in descending order. Then

 Π(β;Sk,d)={π−1(θ)∘sign(β): θ∈Π(¯β;Sk,d)} ,  Π(¯β;Sk,d)={π(|θ|): θ∈Π(β;Sk,d)}
###### Lemma 3.9 ((Jalali and Fazel, 2013)).

The following procedure returns all of the projections of onto defined in (2.1):

• project onto (zero out all entries except the of the entries with largest absolute values) and consider the shortened output ,

• project onto (perform the 1-dimensional K-means algorithm on entries of and stack the corresponding centers with signs according to ),

• put the new entries back in a -dimensional vector, by padding with zeros.

Repeat this procedure when there are multiple choices in steps (i) or (ii).

We will use Equation 3.6 to compute the dual norm and further derive a combinatorial representation for it. Note that while computing the projection itself can be done through K-means, we are interested in a representation for this projection which can can then be used in computing other quantities; as discussed in Section 3.3.

###### Lemma 3.10.

For a given vector , denote by the sorted version of in descending order, i.e., . Then,

 ∥Π(θ;Sk,d)∥22=max{d∑i=11|Ii|(1T¯θIi)2: (I1,⋯,Id)∈¯P(k,d)}

where is the set of all partitions of into groups of consecutive elements. Then,

 [1T¯θIi|Ii|1Ii,⋯1T¯θId|Id|1Id,0,⋯,0]T∈Π(¯θ;Sk,d).

Using Equation 3.6, the statement of the Lemma 3.10 can be alternatively represented as

 (∥β∥⋆k□d)2=(∥¯β∥⋆k□d)2=supA∈¯¯¯¯¯BD(k,d)¯βTA¯β (3.16)

where is nonnegative and non-increasing, and is the set of block diagonal matrices with blocks exactly covering the first rows and columns and zero elsewhere, where on each block of size , all of the entries are equal to . Note that if the input is not a sorted nonnegative vector, then we need to consider , where is the set of signed permutation matrices. This brings us to

 (∥β∥⋆k□d)2=supA∈BD(k,d)βTAβ. (3.17)

The aforementioned representations, in Lemma 3.10, Equation 3.16, and Equation 3.17, all depend on an efficient characterization of combinatorial sets such as or . Lemma 3.11 below shows that is of exponential size, which renders direct optimization inefficient.

###### Lemma 3.11.

.

Lemma 3.11 is proved in Appendix B.

Next, we review a dynamic programming approach to reformulate the above in terms of a quadratic program.

#### 3.4.3 A Dynamic Program and a QCQP Representation

Consider a non-negative sorted vector with . A dynamic program can be used to perform 1-dimensional K-means clustering required in the second step of projection onto (detailed in Lemma 3.9) as well as in Lemma 3.10. For example, see (Wang and Song, 2011) for how a 1-dimensional K-means clustering problem can be cast as a dynamic program. Furthermore, this dynamic program can be represented as a quadratically-constrained quadratic program (QCQP) (Jalali and Fazel, 2013) as discussed next. More specifically, the following two lemmas describe how projection onto and the dual norm unit ball can be computed as solving a QCQP. See Figure 4 for an illustration related to and the dynamic program.

###### Lemma 3.12.

We have

 ∥Π(¯β;Sk,d)∥22=min{νm,e}{νk,d: 1s−m+1(1T¯β[m,s])2≤νs,e−νm−1,e−1  ∀(e,m,s)∈T(k,d)},

where , and .

Proof for Lemma 3.12 is given in Appendix B.

###### Lemma 3.13.

For , we have

 Π(¯θ;B⋆)=argminumin{νm,e}{∥¯θ−u∥22: νk,d≤1, u1≥⋯≥up≥0,1s−m+1(1Tu[m,s])2≤νs,e−νm−1,e−1  ∀(e,m,s)∈T(k,d)}

which is a QCQP.

Proof for Lemma 3.13 is given in Appendix B.

The above provides us with the proximal mapping through . A QCQP such as the one above can be solved via interior point methods among many others.

###### Remark 3.14.

The representation of the dual norm in (3.3) is through a maximization. Therefore, in replacing a dual norm constraint with this representation, we will have as many as constraints which leads to a semi-infinite optimization program in many cases of interest. The representation in (3.8) is also a maximization problem ( squared minus distance squared) with possibly many constraints. However, in the case of , the use of (3.8) allows for reformulation in terms of a dynamic program which reduces the number of constraints from exponentially-many, namely , to .

## 4 Prediction Error Bound for Regularized Least-Squares

Consider a measurement model , where is the design matrix and is a noise vector. For any given norm , and not only those studied in Section 3.1, we then consider the regularized estimator in (1.1) with as the regularization parameter. Rather standard analysis of (1.1) yields prediction error bounds, namely bounds for , as well as estimation error bounds, namely bounds on and . In this section, we review a standard prediction error bound (Lemma 4.1) and then present a novel analysis for establishing bounds on the regularization parameter which is needed in such prediction error bound. Estimation error bounds will be studied in Section 5 building upon the results presented here.

###### Lemma 4.1 (Prediction Error).

If , then obtained from (2.2) satisfies

 1n∥X(β⋆−ˆβ)∥22≤3λ∥β⋆∥. (4.1)

Lemma 4.1 follows from a standard oracle inequality and is proved in Appendix C.

The prediction error bound in Lemma 4.1, and the estimation error bounds in Theorem 5.1, are conditioned on . In this section we make a novel use of the Hanson-Wright inequality to compute this bound for a broad family of noise vectors while our bounds are deterministic with respect to the design matrix. Our proof assumes a concise variational representation for the dual norm (as in (4.2)) and provides a bound in terms of novel aggregate measures of the design matrix induced by the norm (given in (4.8)). In the following, we elaborate on the variational formulation. In Section 4.1, we examine this property for structure norms (defined in Section 3.1). In Section 4.2, we provide examples of norms admitting a concise representation, and finally, in Section 4.3, we state the bounds.

##### A Concise Variational Formulation.

Any squared vector norm can be expressed in a variational form ((Bach et al., 2012) (Prop. 1.8 and Prop. 5.1) and (Jalali et al., 2017)): consider any norm and its dual . Then,