Assumption 1 (A-sparsity)

[10pt] missing

High-Dimensional Learning under Approximate Sparsity: A Unifying Framework for Nonsmooth Learning and Regularized Neural Networks

Hongcheng Liu

Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, liu.h@ufl.edu

Yinyu Ye

Department of Management Science and Engineering, Stanford University, Stanford, CA 94305, yyye@stanford.edu

March 1, 2019

High-dimensional statistical learning (HDSL) has been widely applied in data analysis, operations research, and stochastic optimization. Despite the availability of multiple theoretical frameworks, most HDSL theories stipulate the following two conditions, which are sometimes overly critical: (a) the sparsity, and (b) the restricted strong convexity (RSC). This paper generalizes both conditions via the use of the folded concave penalty (FCP); we show that, for an M-estimation problem where (i) the (conventional) sparsity is relaxed into the approximate sparsity and (ii) the RSC is completely absent, the FCP-based regularization leads to poly-logarithmic sample complexity: the size of the training data is only required to be poly-logarithmic in the problem dimensionality. This finding allows us to further understand two important paradigms much less discussed formerly: the high-dimensional nonsmooth learning and the (deep) neural networks (NN). For both problems, we show that the poly-logarithmic sample complexity can be maintained. Furthermore, via integrating the NN with the FCP, the excess risk of a stationary point to the training formulation for the NN is strictly monotonic with respect to the solution’s suboptimality gap, providing the first theoretical evidence for the empirically observed consistency between the generalization performance and the optimization quality in training an NN.

Key words: Neural network, folded concave penalty, high-dimensional learning, folded concave penalty, support vector machine, nonsmooth learning, restricted strong convexity

History:

This paper considers high-dimensional statistical learning (HDSL), which is to estimate a large number of parameters using few samples, under the following setups and assumptions to be imposed hereafter. Let , for all and , be a sequence of i.i.d. random samples, where for some integer is the corresponding support. For convenience, we let . Consider a measurable, deterministic function , for some integer , and define as the statistical loss with respect to the th sample given a vector of parameters . Then is referred to as the empirical risk function given samples . Let , with expectation over the (unknown) distribution of , be the population version of . Assume that the underlying model is parameterized in the vector of true parameters, , which satisfies The HDSL problem of consideration is then how to recover from the knowledge only of the samples , for , and the loss function , under high-dimensional settings (namely, when ). Particularly, we follow Bartlett et al (2006), Koltchinskii (2010), and Clémenc̣on et al (2008) in considering the excess risk, , as the measure of the recovery quality, a.k.a., the generalization error, of an estimator . We further assume that (a) for some , it holds that , where is the -norm; (b) the empirical function is continuously differentiable and its partial derivatives obey Lipschitz continuity with constant ; that is,

 ∣∣ ∣∣[∂Ln(β,z)∂βj]β=~β+δ⋅ej−[∂Ln(β,z)∂βj]β=~β∣∣ ∣∣≤UL⋅|δ|, (1)

for all , , , almost every , and for some constant .

While the enterprise of HDSL has been shown possible via multiple modern statistical schemes, the focus of this paper is one of the most successful HDSL techniques, initially proposed by Fan and Li (2001) and Zhang (2010), in the formulation below:

 minβ∈Rp {Ln,λ(β,Zn1):=Ln(β,Zn1)+p∑j=1Pλ(|βj|)}, (2)

where is a sparsity-inducing penalty in the form of a folded concave penalty (FCP). Particularly, we consider one of the mainstream special case of the FCP called that minimax concave penalty (MCP) due to Zhang (2010). The specific formulation for the MCP is given as

with . It has been shown that the local and/or global solutions to (\the@equationgroup@IDa) entail desirable statistical performance (Loh and Wainwright 2015, Wang et al 2013a, 2014, Zhang and Zhang 2012, Loh 2017). Alternatively, other sparsity-inducing penalties, such as the smoothly clipped absolute deviation (SCAD) proposed by Fan and Li (2001), the least absolute shrinkage and selection operator (Lasso), as first proposed by Tibshirani (1994), and the bridge penalty (a.k.a., the penalty with ) as discussed by Frank and Friedman (1993), have all been shown very effective in HDSL by results due to Fan and Li (2001), Bickel et al (2009), Fan and Lv (2011), Fan et al (2014), Loh and Wainwright (2015), Raskutti et al (2011), Negahban et al (2012), Wang et al (2013a, 2014), Zhang and Zhang (2012), Zou (2006), Zou and Li (2008), Liu et al (2017a, 2018) and Loh (2017), etc. Furthermore, Ndiaye et al (2017), El Ghaoui (2010), Fan and Li (2001), Chen et al (2010), and Liu et al (2017a) have presented thresholding rules and bounds on the number of nonzero dimensions for a high-dimensional linear regression problem with different penalty functions. Bühlmann and van de Geer (2011) and Fan et al (2014) have provided excellent reviews on the HDSL theories. Nonetheless, most existing results rely on both of the two plausible assumptions: the (conventional) sparsity, written as with being the number of nonzero entries of a vector, and the restricted strong convexity (RSC), interpretable as the stipulation of the inequality of strong convexity to be satisfied in a restricted subset of . Little is known about the generalization performance of (\the@equationgroup@IDa), as well as HDSL in general, when either of the two conditions is violated.

The proposed framework in this paper pertains the same formulation as in Eq. (\the@equationgroup@IDa). In contrast to the literature, we are concerned with employing (\the@equationgroup@IDa) to address the HDSL problems where the RSC is missing and/or the traditional sparsity is relaxed into the approximate sparsity (A-sparsity) as below.

###### Assumption 1 (A-sparsity)

For some known to be sparse, that is, , it holds that , for some unknown and .

Notice that, if , then Assumption 1 is reduced to the (traditional) sparsity. Intuitively, A-sparsity means that, although can be dense, most of the entries of are small in magnitude so that rounding them to zero does not impact the excess risk much or even make it better, such as in the case where dropout-based regularization tends to improve the excess risk of a neural network. We believe that those weaker conditions will allow the HDSL theories to cover a wider class of applications. Indeed, as we later will articulate, A-sparsity and the non-contingency of the RSC lead to the comprehension of two important problems that are historically scarcely discussed by the literature: (i) An HDSL problem which has a non-differentiable empirical risk function, and (ii) a (deep) neural network model. Both cases will be explained in more detail subsequently.

To our knowledge, the only existing discussions on A-Sparsity beyond the RSC is due to Liu et al (2018, 2019), when the underlying distribution is sub-gaussian and/or is twice-differentiable for a.e. . Their results imply that, if is convex for almost every and may violate the RSC, the excess risk of an estimator generated as a certain stationary point to the formulation (\the@equationgroup@IDa) can be bounded by . (Here represents the rate of the sample complexity ignoring quantities independent of , , , and logarithmic terms that do not depend on .) This bound is reduced to when . In contrast, our findings in the current paper strengthen the results from therein. Specifically, we relax the sub-gaussian assumption stipulated by Liu et al (2018) and impose the weaker, sub-exponential, condition instead. In addition, the assumption on twice-differentiability made by Liu et al (2018, 2019) is also weakened into continuous differentiability. Under the more general settings, we further show that comparable, if not better, error bounds can be achieved at a stationary point that (a) satisfies a set of significant subspace second-order necessary conditions (SONC), and (b) has an objective function value no worse than that of the solution to the Lasso problem:

 minβ∈Rp{Ln(β,Zn1)+p∑j=1λ⋅|βj|}. (4)

We will discuss some SONC-guaranteeing algorithms to satisfy the first requirement soon afterwards; as for the second requirement, we may always initialize the SONC-guaranteeing algorithm with the solution to the Lasso problem in Eq. (\the@equationgroup@IDc), which is often polynomial-time solvable if is convex given . Our new bounds on those SONC solutions are summarized as below. First, in the case where , we can bound the excess risk by , which is better than the aforementioned result by Liu et al (2018) in the dependance on . Second, when is nonzero, the excess risk is then bounded by

 ~O(lnpn2/3+√lnpn1/3+√^εn1/3+^ε). (5)

Third, if we further relax the requirement (b) above and consider an arbitrary SONC solution, then the excess risk becomes

 ~O(lnpn2/3+√lnpn+1n1/3+√Γ+^εn1/3+Γ+^ε), (6)

where is the suboptimality gap which this SONC solution incurs in minimizing . Note that (\the@equationgroup@IDd) is a special case of (\the@equationgroup@IDe). This is because, as per our analysis, the solution to the Lasso problem (\the@equationgroup@IDc), for some choices of parameters, will incur a suboptimality gap of the order and thus the SONC solution that satisfies the requirement (b) will have . Then (\the@equationgroup@IDe) can be simplified into (\the@equationgroup@IDd) by observing that (since by assumption).

The SONC, as in the requirement (a), is a necessary condition for local minimality and is weaker than the standard second-order KKT conditions. Admittedly, the SONC entails less important structures than the second-order KKT conditions, yet the former is sufficient to ensure all the claimed results herein for the HDSL. To ensure the SONC, it admits pseudo-polynomial-time algorithms, such as the variants of the Newton’s method proposed by Haeser et al (2017), Bian et al (2015), Ye (1992, 1998) and Nesterov and Polyak (2006). All those variants provably ensure a -approximation (with ) to the second-order KKT conditions, which imply the SONC, at the best-known computational complexity , where ignores all quantities independent of . In contrast, this paper proposes a gradient-based method that theoretically ensures the SONC at pseudo-polynomial-time complexity for some proper choices of the penalty parameter as in Eq. (\the@equationgroup@IDb). The resulting theoretical complexity is of the order in generating a -approximation, significantly more efficient than the existing algorithms. Furthermore, due to its gradient-based nature, the per-iteration complexity does not require the computation of the Hessian matrix and thus may outperform the Newton-type approaches, which often require the repetitive computation of the full Hession matrix as well as its inverse. Therefore, we think that this gradient-based algorithm may be of some independent interest.

Admittedly, the rate of our bounds on the excess risk are less appealing than those made available in some important former work by Loh (2017), Raskutti et al (2011), and Negahban et al (2012), etc., under the RSC. However, we argue that our results are established for a general M-estimator that may complement the existing results by removing the stipulation of the RSC or alike. It is also worth noting that the bounds on some more important statistical metrics, such as the - and -loss, which are discussed formerly by Loh (2017), Raskutti et al (2011), Negahban et al (2012), and Loh and Wainwright (2015), become unavailable under the settings of this paper as we seek to address problems with much less regularities. Nonetheless, as an alternative, the excess risk is also a common metric useful to understand the generalization performance. For example, Bartlett et al (2006), Koltchinskii (2010), and Clémenc̣on et al (2008) all consider the excess risk as an important, if not the primary, performance measure.

Apart from generalizing the conventional sparsity into a wider spectrum of applications, the notion of A-sparsity has important implications to high-dimensional nonsmooth learning and (deep) neural networks, as discussed in more detail below.

Most existing HDSL theories, such as those by Fan et al (2014), Raskutti et al (2011), Loh and Wainwright (2015) and Wang et al (2014) assume differentiable statistical loss functions. Hence, it is unknown how these results can be applied to learning problems without differentiability, such as those with absolute deviation, hinge loss, quantile loss, and “-insensitive” functions (Painsky and Rosset 2016). Although special cases of HDSL with nonsmoothness, such as high-dimensional least absolute regression, high-dimensional quantile regression, and high-dimensional support vector machine (SVM) has been discussed by Wang (2013), Belloni and Chernozhukov (2011), Zhang et al (2016b, c) and/or Peng et al (2016), there exist no theory that applies generally to nonsmooth learning. In contrast, we consider a flexible set of high-dimensional nonsmooth learning problems given as below:

 minβ1nn∑i=1[Lns(β,Zi):=f1(β,Zi)+maxu∈U{u⊤A(Zi)β−^ϕ(u,Zi)}], (7)

where and are measurable, deterministic functions, is a known linear operator, and is a convex and compact set with a diameter . We assume further that is continuously differentiable with its partial derivatives being Lipschitz continuous with constant for almost every , that is, , for all , , and , and that is convex and continuous for almost every . To recover the true parameters, with same abuse of notations, under the assumptions that , the corresponding high-dimensional estimation problem would be naturally formulated as

 (8)

Problem (\the@equationgroup@IDf) has a non-differentiable empirical risk function in general due to the presence of a maximum operator. It is easily verifiable that the least quantile linear regression, the least absolute deviation regression, and the SVM are all special cases to (\the@equationgroup@IDf). Therefore, their high-dimensional counterparts with sparsity-inducing regularization are formulated as in (\the@equationgroup@IDg). For this type of nonsmooth learning problems, we propose a modification to (\the@equationgroup@IDg) formulated as below:

 (9)

for a user-specific and (which is chosen to be later in our theory). This modification adds regularities to the original problem while introducing controllable approximation errors; that is, is a continuously differentiable approximation to and the approximation error increases as decreases. A similar approach has been discussed in the context of nonsmooth optimization by Nesterov (2005).

For a pseudo-polynomial-time computable SONC solution to (\the@equationgroup@IDh), we show that the excess risk of is bounded by

 ~O(√lnpn1/4), (10)

with overwhelming probability. To our knowledge, this is perhaps the first generic theory for high-dimensional M-estimation in which the empirical risk function is allowed to be non-differentiable.

A neural network (NN) refers to a machine learning model defined in a nested fashion as below. Denote by an activation function, such as the ReLU, , the softplus, and the sigmoid, The NN model is then a network that consists of input units, output units, and hidden computation units. Each of those units are referred to as neurons. The NN model often consists of three or more layers (groups) of neurons. The first layer contains one or more input units, the last layer contains one or more output units, and every other layer contains one or more hidden computation units. The layers that only have hidden computation units are referred to as the hidden layers. Hierarchies are formed in the sense that a neuron will only be fed information from the ones in the preceding layers, namely, the layers that are closer to the input layer. If information can be passed from a neuron A to a neuron B, then we say that the link exists from the neuron A to the neuron B.

Each hidden computation unit performs an operation of the form with fitting parameters (a.k.a., bias terms and weights) denoted as , where stands for the number of neurons that are linked to this hidden computation unit in the previous layer(s) and is the output from one of those neurons. The output units are also computational units which yield , for some fitting parameters and the number of linked neurons . With some abuse of notations, we collect all the fitting parameters of an NN and denote them by the vector for a proper dimension . In our discussion, we follow Yarotsky (2017) to assume that there is only one output node. Yet, just like the results by Yarotsky (2017), our findings can be easily generalized to the cases where more than one output nodes are in presence.

The NNs have been frequently discussed and widely applied in recent literature (Schmidhuber 2015, LeCun et al 2015, Yarotsky 2017). Despite the promising and frequent advancement in NN-related algorithms, models, and applications, the development of their theoretical underpinnings are seemingly lagging behind. To analyze the efficacy of an NN, several exciting results, including DeVore et al (1989), Yarotsky (2017), Mhaskar and Poggio (2016), and Mhaskar (1996), have explicated the expressive power of the NNs in the universal approximation to different types of functions. Apart from the expressive power, however, other theoretical and statistical aspects of the NNs have been scarcely explored. Particularly, one open question looms large from the recent literature: how to theoretically ensure the generalization performance of the NNs when it is trained with finitely many samples. To address this question, some significant advances have been reported by Bartlett et al (2017), Golowich et al (2017), Neyshabur et al (2015, 2017, 2018), and Arora et al (2018). For most of the existing results, the generalization error depends polynomially in the dimensionality (number of weights). Based on those bounds, it seems necessary to require the sample size to be larger than the number of fitting parameters in order to ensure the model to be properly generalizable. This is, however, inconsistent with the successful performance of NNs in many practical applications, since “over-parameterization” is common in many successful applications of the NNs and the number of fitting parameters grows rapidly as one makes the NNs deeper for more sophisticated tasks. To our knowledge, the only existing theory that explains the generalization performance of an NN under “over-parameterization” is by Barron and Klusowski (2018), who show the possibility that the generalization error can be deteriorating in only the logarithm of the dimensionality. Nonetheless, the discussions by Barron and Klusowski (2018) focus on the existence of such an NN under the assumption of a specific form of activation function called the ramp function. It is then unknown how to train the NN to obtain the desired generalization performance; in other words, it is unclear if any local/global optimal solution is sufficient or any specialized training schemes should be employed to generate the desired NN that is shown existent.

Our results provide an HDSL-based analysis on the NNs that may entail a similar advantage as those by Barron and Klusowski (2018) in terms being insensitive to the increase of the number of the fitting parameters and thus to the growth of the hidden layers; our bound on the generalization error is also logarithmic in dimensionality. Furthermore, in contrast to Barron and Klusowski (2018), our analysis may present better flexibility and more insights towards the computability of a desired solution: We provide a generalization bound to all stationary points in a regularized model fitting problem for an NN and our results apply to a flexible class of activation functions, including ReLU functions. We think that the results herein can help understand better the NNs’ powerful performance in practice.

Specifically, we will consider the following estimation problem: Let where , for and some , be random inputs and white noises. We are then interested in approximating through a neural network , parameterized in with , through the training with only the observations of identically distributed random variables , for , and is assumed as a sequence of independent random variables. Also assume that for any , which is verifiably satisfied by most existing versions of NNs. The training model of interest is then formulated as a special case to (\the@equationgroup@IDa):

 minβTλ(β):=12nn∑i=1(yi−FNN(xi,β))2+p∑j=1Pλ(|βj|). (11)

where is one of the most commonly used formulations of the empirical risk in training a neural network. Though we choose this particular formulation, we would like to remark that our machinery can be easily generalized to formulations with alternative empirical risk functions. Our results indicate that, for any NN architecture and any choice of , the generalization error incurred by an SONC solution to (\the@equationgroup@IDj) with suboptimality gap is bounded by

 ~O(1)⋅(^sn2/3+√^s+1n+1n1/3)⋅lnp% Minimal generalization error+ΓSuboptimality gap+Ω(^s)+Ω(p)Representability gap+~O⋅√Γ+Ω(^s)n1/3Interaction term,

for any fixed , with overwhelming probability, where , for any , is the (minimal) representability error (a.k.a., model misspecification error) of a neural network with -many nonzero parameters given a fixed network architecture, calculated as

 Ω(^p)=minβ:∥β∥0≤^p maxx∈[0,1]d |FNN(x,β)−g(x)|, (12)

for any . Evidently, if . As is seen from the above bound, the generalization error of an NN consists of four terms: (i) the minimal generalization error term that is in presence regardless of how well the network is trained in terms of the optimization quality of solving (\the@equationgroup@IDj); (ii) the suboptimality gap term that measures the optimization quality; (iii) a term that measures the misspecification error; and (iv) a term that is intertwined with suboptimality gap, sample size, and representability.

Furthermore, in the special case of a ReLU neural network, when has a well-defined -th order weak derivative, the generalization error is then reduced to

 ~O((r+dd⋅n1/3+1n16+r3d+1nr3d+√Γn1/6)⋅lnp+Γ), (13)

for an SONC solution (a stationary point) in the -sublevel set with overwhelming probability, for any . For example, if is a polynomial function, (we may as well let in this case, since is verifiably -times weakly differentiable;) then the generalization error becomes: . To our knowledge, this is the first computable theory that allows the NN to entail a desirable generalization performance insensitive to the dimensionality; if the dimensionality increases exponentially, one only needs to polynomially increase the sample size to maintain the same level of generalization error. This result is particularly suitable for deep NNs in view of “over-parameterization”. Furthermore, it is worth noting that we do not artificially impose any condition on sparsity or alike in order to establish the above result. In fact, the A-sparsity is an intrinsic property of an NN.

We would like to make an additional remark on our generalization bound (13). By this bound, each stationary point entails a bounded excess risk that is strictly monotonic in the suboptimality gap; the better is the optimization quality (that is, the smaller ), the more desirable is the generalization error. This is the desired consistency between optimization performance and generalization performance in training an NN, a phenomena empirically observed by multiple former works (e.g., by Zhang et al 2016a, Wan et al 2013b) when the NNs are trained with various regularization schemes. To our knowledge, this paper presents the first theoretical evidence for such consistency.

Our results for NNs may justify and/or can be combined with the use of the alternative sparsity-inducing regularization schemes, such as Dropout (Srivastava et al 2014), sparsity-inducing penalization (Han et al 2015, Scardapane et al 2017, Louizos et al 2017, Wen et al 2016), DropConnect (Wan et al 2013b), and pruning (Alford 2018), etc. Nonetheless, most of those former results focus on the numerical aspects of the different regularization and little is known if the empirical successes therein could have a theoretical guarantee. Although Wan et al (2013b) presented some generalization error analyses for DropConnect, the correlation among the dimensionality, the generalization error, and the sample size is not explicated therein.

Table 1 summarizes the sample complexities from this paper. In contrast to the literature, we claim that our results will lead to the following contributions:

• We provide the first HDSL theory for problems where the three conditions, twice-differentiabiliity, RSC, and sparsity, may be simultaneously relaxed. Our results apply to a flexible class of problems where the empirical risk function is only continuously differentiable, the RSC is completely absent, and the true parameters satisfy the A-sparsity, (that is, they are not necessarily sparse but can be approximated by a sparse vector). HDSL is possible in such scenarios even if the sample size is only poly-logarithmic in the dimensionality. Particularly, the resulting sample complexity is presented in Table 1 in the rows for “SONC to (\the@equationgroup@IDa), initialized with Lasso” and “SONC to (\the@equationgroup@IDa) with suboptimality gap ”, where is the error introduced by the sparse approximation as introduced in Definition 1.

• Based on the above, we put forward a unifying framework to analyze the HDSL problems where the empirical risk function is non-differentiable and the non-differentiability is introduced by a convex piecewise function. Poly-logarithmic sample complexity is also attained when the (conventional) sparsity assumption is imposed. The detailed bound is presented in Table 1 in the row for “SONC to (\the@equationgroup@IDh) initialized with Lasso”.

• We present perhaps the first theory on the integration of NN with HDSL and show that any SONC solution to the NN formulation regularized with the FCP entails a bounded generalization error, which is poly-logarithmic in the number of fitting parameters and, thus, only poly-logarithmic in the number of hidden layers, as presented in Table 1 in the rows for “SONC to (\the@equationgroup@IDj) with suboptimality gap ” and “SONC to (\the@equationgroup@IDj) with suboptimality gap when is polynomial”.

• We derive a novel first-order method that can achieve the SONC at pseudo-polynomial-time complexity and is provably more advantageous than the existing second-order algorithms in generating statistically desirable solutions to the HDSL; even though the SONC is a second-order necessary condition, the proposed algorithm does not need to access the Hessian matrix, resulting a much lower per-iteration cost. Furthermore, the iteration complexity is one order of magnitude better than the conventional second-order algorithms.

Note that, in Table 1, the parameter is an intrinsic factor in characterizing the generalization performance of all the stationary points that satisfy the SONC with different optimization quality; the generalization errors of those solutions are dependent on . For problems with a convex empirical risk function, the impact of can be well contained; initialized with the Lasso solution, which usually is tractable under the convexity assumption, the resulting SONC solution will yield a suboptimality gap vanishing in . For the training of an NN, on the other hand, we are not able to tractably guarantee to be small, due to the innate nonconvexity. Nonetheless, in practice much empirical results indicate that can be small for many NN-based variants, such as those by Wan et al (2013b) and Alford (2018). Furthermore, it is a currently another important direction to show that some stationary points of the NN model are close to the global minimizer so that is indeed small (See, e.g., Du 2018, Haeffele and Vidal 2017).

The rest of the paper is organized as below: Section id1 introduces the SONC. Section id1 states the main results, whose proofs are provided in Section id1. Section id1 discusses the theoretical applications of our main results to high-dimensional nonsmooth learning and the regularized (deep) neural networks. A pseudo-polynomial-time solution scheme that guarantees the SONC is introduced in Section id1. Some numerical evidences on our theoretical findings are presented in Section 15. Finally, Section id1 concludes our paper.

We will denote by () the -norm, except that - and -norms are denoted by and , respectively. When there is no ambiguity, we also denote by the cardinality of a set, if the argument is then a finite set. Let of a matrix be its Frobenius norm. Also denote by the number of non-zero dimensions of a vector. We use and to represent the numbers of dimensions and samples. We denote that . With some abuse of terminology, we will refer to a solution that satisfies the SONC as an SONC solution. For a function , denote by its gradient. For a vector and a set , let be a sub-vector of . For a random variable , the sub-gaussian and sub-exponential norms of are denoted and , respectively. Finally, denotes the vector with a 1 in the th coordinate and 0’s elsewhere.

In this paper, we generalize the significant subspace second-order necessary condition (SONC) by Chen et al (2010) and Liu et al (2017a, 2018). Specifically, Chen et al (2010) provide a second-order necessary condition that is equivalent to the SONC for linear regression with bridge regularization. Then, Liu et al (2017a, 2018) consider the SONC in a more general setting under the assumption that the empirical risk function is twice differentiable. Such an assumption is further relaxed in this paper.

###### Definition 1

For given , a vector is said to satisfy the SONC (denoted by SONC) of the problem (\the@equationgroup@IDa) if both of the following sets of conditions are satisfied:

1. The first-order KKT conditions are met at ; that is, there exists :

 ∇Ln,λ(^β,Zn1)=0, (14)

where is the sub-differential of w.r.t.  .

2. The following inequality holds at : for all , if , then

 UL+P′′λ(|^βj|)≥0. (15)

Here , defined as in (\the@equationgroup@ID), is the component-wise Lipschitz constant. One may easily verify that the SONC is implied by the second-order KKT conditions. With some abuse of notations, the SONC() in the special case of problem (\the@equationgroup@IDh) is then the same set of conditions with . Meanwhile, the SONC() in (\the@equationgroup@IDj) is referred to as the SONC(), since the random samples are in the format of and thus the notation with and .

It has been shown by Haeser et al (2017), Bian et al (2015), and Ye (1998) that the second-order KKT conditions, which imply the SONC, are pseudo-polynomial-time computable. Our new algorithm in Section id1 also ensures the SONC at pseudo-polynomial-time cost.

Our assumptions concern the tail of the underlying distribution (Assumption 2) and continuity (Assumption 3).

###### Assumption 2

For all , it holds that , , are independent random variables following sub-exponential distributions; that is, for some .

###### Remark 1

As an implication of the above assumption, for all , it holds that

 P(∣∣ ∣∣n∑i=1ai{L(β,Zi)−E[L(β,Zi)]}∣∣ ∣∣>σ(∥a∥√t+∥a∥∞t))≤2exp(−ct),∀t≥0,a=(ai)∈Rn, (16)

for some absolute constant . Interested readers are referred to Vershynin (2012) for more discussions on the sub-exponential distribution. For notational simplicity in our discussion later, we have let .

###### Assumption 3

For some measurable and deterministic function , the random variable satisfies that for all for some , where we let for all for some . Furthermore, for all and almost every .

###### Remark 2

The stipulations of and can be easily relaxed and are needed only for notational simplicity in our results to be presented.

###### Remark 3

Assumptions 2 and 3 are general enough to cover a wide spectrum of M-estimation problems. More specifically, Assumption 2 requires that the underlying distribution is sub-exponential, and Assumption 3 essentially imposes Lipschitz continuity on . These two assumptions are either standard or easily verifiable. The combination of our Assumptions 1 through 3 are non-trivially weaker than the settings in Liu et al (2017a, 2018).

Introduce a few short-hand notations: and . We are now ready to present our claimed results. We will first present the most general result of this paper in Proposition 1. In this proposition, the parameter is left to be optimally determined in different special cases. Then Theorem 1 presents one of those cases.

###### Proposition 1

Suppose that Assumptions 1 through 3 hold. For any , let for the same in (16) and . The following statements hold at any solution that satisfies the SONC to (\the@equationgroup@IDa) almost surely:

• For any and some universal constants , if

 n>C1⋅⎡⎣(Γ+^εσ)11−2ϱ+s⋅(ln(nϱp)+~ζ)⎤⎦, (17)

and almost surely, then the excess risk is bounded by

 L(^β)−L(β∗)≤C1⋅⎛⎜ ⎜ ⎜⎝s⋅(ln(nϱp)+~ζ)n2ϱ+ ⎷s⋅(ln(nϱp)+~ζ)n+1nϱ+1n1−2ϱ+1n1−ϱ2⎞⎟ ⎟ ⎟⎠⋅σ+C1⋅√σ(Γ+^ε)n1−2ϱ+Γ+^ε, (18)

with probability at least .

• For some universal constants , if

 n>C2⋅(^εσ)11−2ϱ+C2⋅a−1⋅[ln(nϱp)+~ζ]⋅smax{1,12−4ϱ,12ϱ}Rmax{12−4ϱ,12ϱ}, (19)

and almost surely, then the excess risk is bounded by

 L(^β)−L(β∗)≤C2⋅⎡⎢ ⎢⎣s(ln(nϱp)+~ζ)n2ϱ+1nϱ+1n1−2ϱ⎤⎥ ⎥⎦⋅σ+C2⋅sRσ3/4min{a1/2nϱ,a1/4n1−ϱ2}[ln(nϱp)+~ζ]1/2+C2⋅√σ^εn1−2ϱ+^ε, (20)

with probability at least .

###### Proof.

Proof. See Section id1.  Q.E.D.

###### Remark 4

Proposition 1 does not rely on convexity or RSC. The first part of Proposition 1 presents the most general result that we have. Specifically, we show that, for all the SONC solutions, the excess risk can be bounded by a function of the suboptimality gap . This explicates the consistency between the statistical performance of a stationary point to an HDSL problem and the optimization quality of that stationary point in minimizing the corresponding penalized objective function. The second part of Proposition 1 concerns an arbitrary SONC solution that has an objective function value, measured by , smaller than that of . For such a solution, the excess risk can be well contained even if the sample size is only poly-logarithmic in the number of dimensions. This desired solution can be generated by a two-step approach: We first solve for , which is often polynomial-time computable if is convex given . Then, we compute for an SONC solution through a local optimization approach that uses as the initial point.

###### Remark 5

We may as well let and thus to satisfy the stipulation of Theorem 1.

###### Remark 6

For any feasible choice of , each of the two parts of Proposition 1 has already established the poly-logarithmic correlation between the dimensionality and the sample size ; polynomially increasing the sample size can compensate the exponential growth in dimensionality. We may further pick a reasonable value for and obtain a more detailed result as in Theorem 1 below.

###### Theorem 1

Let and for the same in (16). Suppose that Assumptions 1 through 3 hold. Let satisfy the SONC to (\the@equationgroup@IDa) almost surely. The following statements hold:

• For any and some universal constants , if

 n>C3⋅[(Γ+^εσ)3+s⋅(ln(n1/3p)+~ζ)], (21)

and almost surely, then the excess risk is bounded by

 L(^β)−L(β∗)≤C3σ⋅⎡⎢ ⎢ ⎢⎣s⋅(ln(n1/3p)+~ζ)n2/3+ ⎷s⋅(ln(n1/3p)+~ζ)n+1n1/3⎤⎥ ⎥ ⎥⎦+C3⋅√σ(Γ+^ε)n1/3+Γ+^ε (22)

with probability at least .

• For some universal constant , if

 n>C4⋅(^εσ)3+C4⋅a−1⋅[ln(n13p)+~ζ]⋅s32R32, (23)

and almost surely, then the excess risk is bounded by

 L(^β)−L(β∗)≤C4⋅a−1/2⋅s⋅σ⋅⎡⎢ ⎢ ⎢ ⎢⎣(ln(n13p)+~ζ)n23+R√ln(n13p)+~ζn13⎤⎥ ⎥ ⎥ ⎥⎦+C4⋅√σ^εn1/3+^ε (24)

with probability at least .

###### Proof.

Proof. Immediate from Proposition 1 with .  Q.E.D.

Our remarks concerning Proposition 1 also apply to Theorem 1, since the latter is a special case when . We would like to point out that, if , the excess risk in (24) is simplified into .

Below we present two important theoretical applications of the HDSL theories under A-sparsity. Section id1 presents the discussion for a flexible class of high-dimensional nonsmooth learning. Section 7 then considers the generalization performance of a regularized (deep) neural network.

We consider the high-dimensional nonsmooth learning problems whose setups have been discussed by Section id1. The excess risk, as promised in (\the@equationgroup@IDi), is then provided by Theorem 2, in which we will let