# Statistical Analysis of Stationary Solutions of

Coupled Nonconvex Nonsmooth Empirical Risk Minimization

###### Abstract

This paper has two main goals: (a) establish several statistical properties—consistency, asymptotic distributions, and convergence rates—of stationary solutions and values of a class of coupled nonconvex and nonsmooth empirical risk minimization problems, and (b) validate these properties by a noisy amplitude-based phase retrieval problem, the latter being of much topical interest. Derived from available data via sampling, these empirical risk minimization problems are the computational workhorse of a population risk model which involves the minimization of an expected value of a random functional. When these minimization problems are nonconvex, the computation of their globally optimal solutions is elusive. Together with the fact that the expectation operator cannot be evaluated for general probability distributions, it becomes necessary to justify whether the stationary solutions of the empirical problems are practical approximations of the stationary solution of the population problem. When these two features, general distribution and nonconvexity, are coupled with nondifferentiability that often renders the problems “non-Clarke regular”, the task of the justification becomes challenging. Our work aims to address such a challenge within an algorithm-free setting. The resulting analysis is therefore different from the much of the analysis in the recent literature that is based on local search algorithms. Furthermore, supplementing the classical minimizer-centric analysis, our results offer a first step to close the gap between computational optimization and asymptotic analysis of coupled nonconvex nonsmooth statistical estimation problems, expanding the former with statistical properties of the practically obtained solution and providing the latter with a more practical focus pertaining to computational tractability.

KEY WORDS: Statistical analysis; Consistency; Convergence rates; Directional stationarity; Asymptotic distribution; Nonconvexity; Nonsmoothness; Phase retrieval problem.

## 1 Introduction

Given a probability space , where is the sample space, is the -field generated by , and is the corresponding probability measure, a parameterized random function , and a compact convex set , we consider the population risk minimization problem

(1) |

In this setting, is a random vector defined on the probability triple ; the tilde on signifies a random variable, whereas without the tilde will refer to a realization of the random variable. This convention of distinguishing a random variable and its realizations will be used throughout the paper. Subsequently, structure of will be imposed for the purpose of analysis. The expectation function in (1) often does not have a closed form expression so that algorithms for solving deterministic optimization problems may not be directly applicable. There are two classical Monte-Carlo sampling based approaches to solve the expected-value minimization problem (1): stochastic approximation (SA) and sample average approximation (SAA). The SA proposed by Robbins and Monro [42] in the 1950s is a stochastic (sub)gradient method that updates each iterate along the opposite (sub)gradient direction estimated from one or a small batch of samples. It has attracted a great attention recently in machine leaning and stochastic programming communities, partially due to its scalability and easy fitting to the online settings. Interested readers are referred to [8, 40, 41, 37] and the references therein for the development of the SA. The SAA method, on the other hand, takes independent and identically distributed (i.i.d) random samples with the same distribution as and estimate the expectation function with the sample average approximation, resulting in the empirical risk minimization or the M-estimation problem:

(2) |

There is a vast literature on the asymptotic analysis of the M-estimators/SAA solutions related to the optimal solution of the expectation problem (1) as the sample size goes to infinity. The first celebrated consistency result dates back to 1920s by R.A. Fisher in [19, 20] for the maximum likelihood estimation (MLE) problems. An proof of the consistency of MLE is given by Wald in [59]. Notice that the MLE is a special case of the problem (1) if we take the function as the negative logarithm of probability density/mass functions. Other important developments of the global optimal solutions of the M-estimation in the statistical literature include [24, 7, 28, 49]. The consistency and asymptotic distributions of the local optimal solutions for smooth optimization problems are studied by Geyer in [21]. Most recently, Royset et al. [45, 46] employed variational analysis to study statistical properties of M-estimators of non-parametric problems. In the field of stochastic programming, the study of the asymptotic behavior of the optimal solutions begins with the work of Wets [61], and is further developed in [15, 50] with inequality constraints and nonsmooth objective functions using the tools from nonsmooth analysis. Recently, the article [12] studies the statistical estimation of composite risk functionals and risk optimization problems and establishes a central limit formula of their optimal values when an estimator of the risk functional is used. Interested readers are referred to the monographs [54, Section 5.2] and [52, Section 5] for comprehensive treatment of the asymptotic analysis of the M-estimators/SAA solutions. However, all these results pertain to the global or local minimizers of the optimization problems or the (globally) optimal objective values, regardless of the possibility that the latter problems may be nonconvex. Since in general one cannot find a global or local optimal solution to the nonconvex optimization problems, any consistency results that are based on the global or local minimizers are at best ideal targets for such problems and have little practical significance. The situation becomes more serious when nondifferentiability is coupled with nonconvexity because there is a host of stationary solutions of the resulting optimization problems. Typically, the sharper the stationarity solution is (sharp in the sense of least relaxation in its definition), the more difficult it is to compute. It is thus important to understand whether in practice, the focus should be placed on computing sharp stationary solutions (which distinguish themselves as being the ones that must satisfy all other relaxed definitions of stationarity) that potentially require higher computational costs versus computing some less demanding solutions. Our derived results show that the sharpness of the stationarity at the empirical level is preserved at the population level, thus favoring the former. Furthermore, via a noisy amplitude-based phase retrieval problem that is of much topical interest, we demonstrate that a stationary point of a relaxed kind can have no bearings to a minimizer, both in the population and empirical problems. In short, there is presently a gap in the literature between the asymptotic minimizer-centric analysis of statistical estimation problems in the presence of (coupled) nonconvexity and nondifferentiability and the computational tractability of the solutions being analyzed. Our work offers a first step in closing this gap.

When the expected-value objective function in (1) is differentiable, the stationary points of problem (1) can be characterized by the solutions of the following stochastic generalized equation

where denotes the normal cone of at as in convex analysis, see, e.g., [43]. Similarly, a stationary point of the empirical risk minimization (2) satisfies

The consistency and asymptotic distributions of the solutions for such a stochastic generalized equation have been established in the literature such as [27, 23, 51]. See also [35] for the correspondence of stationary solutions between the empirical risk and the population risk when the sample size is sufficiently large.

While the consistency of the global optimal values and solutions is mainly due to the uniform law of large numbers for real-valued random functions, the consistency of the stationary solutions of nonconvex nonsmooth problems needs the uniform law of large numbers for set-valued subdifferentiable mappings. It is well known that Attouch’s celebrated theorem on the equivalence of the epiconvergence of a sequence of convex functions and the graphical convergence of the subdifferential [1] fails for general nonconvex functions, which makes the asymptotic analysis for the SAA a challenging task when applied to a nonconvex problem. For a special case where the function is Clarke regular [9, Section 2] for almost all , the uniform law of large numbers for random set-valued Clarke regular mappings is established in [53] and the consistency of Clarke stationary points is also provided therein.

Many modern statistical and machine learning problems consist of inherently coupled nonconvex and nonsmooth objective functions. More specifically, the objective functions therein cannot be decomposed into either the sum of a smooth nonconvex function and a nonsmooth convex function, or the composition of a convex function and a smooth function; see the examples in Section 2. Such functions often fail to satisfy the Clarke regularity so that the results in [53] are no longer valid. In particular, the inclusions (8) and (9) can be strict. Furthermore, the classical (let alone uniform) law of large numbers of random variables cannot be easily extended to such random functions. Adding to this difficulty, the discontinuity of results in the possible failure of the continuous convergence of the sample average functions. Back to the optimization problem in (2), a natural way to tackle the nondifferentiable objective function seems to be the smoothing approach. Xu and Zhang [58] show that the stationary point of the smoothed problem converges to a so-called weak (Clarke) stationary point of the original expectation problem. This is a very nice theoretical result. However, the Lipschitz constant of the gradient of the smoothed problem goes to infinity as the smoothing parameter goes to zero. This fact makes it difficult for the smoothed version of (2) to be solved efficiently by either gradient-type or Newton-type methods, thus weakening the practical significance of the mentioned convergence result.

There is an increasing literature that are focused on studying the convergence of a particular algorithm for nonconvex M-estimation problems with the guarantee of statistical accuracy. For example, relying on the restricted strong convexity, the references [30, 31, 32] show that gradient decent method with a proper initialization converges to the statistical “truth” for different regression models with nonconvex objective functions. Adding to these references, the paper [35] recently establishes a one-to-one correspondence of stationary solutions of non-convex M-estimation problems by analyzing the landscape of the empirical problem. However, existing literature relies heavily on the smoothness of M-estimation problems and their special structure such as restricted strong convexity, which limit their applications on analyzing a broad class of modern statistical and machine learning problems, such as the examples in Section 2.

In this work, we are taking a first step to establish the consistency of the stationary point for a class of coupled nonconvex and nonsmooth empirical risk minimization problems. Our focus is placed on the asymptotic behavior of the directional stationary points of problem (2), which distinguish themselves as being the sharpest kind among all stationary solutions of such objectives, such as the Clarke stationarity that defined in (7). We consider a class of composite functions that covers a wide range of practical applications spanning modern statistical estimation and machine learning. For problems in this class, it has been shown in [10] that their empirical directional stationary points are computationally tractable by iteratively solving convex subprograms. Our results demonstrate that the additional efforts as required by the algorithm in the latter reference for computing the empirical directional stationary point of a sharp kind pay off not only at the empirical level, but also at the population level. It should be noted that our general analysis is independent of particular algorithms and thus is broadly applicable. Finally, we apply our developed theory to the noisy amplitude-based phase retrieval problem and show that every empirical directional stationary point, which can be computed by an algorithm described in [10], is -consistent to a global minimizer of the corresponding population problem. As our approach is algorithm-free, the analysis is different from much of the existing literature such as [34] that requires algorithm-based local search.

To summarize, the contributions of this paper are as follows:

we directly address the asymptotic convergence of the SAA stationary solutions for nonconvex nondifferentiable problems without Clarke regularity of the objective function, and establish results that are not linked to particular algorithms;

we establish the consistency and derive the convergence rate of empirical local minimizers to population local minimizers for a class of composite nonconvex, nonsmooth, and non-Clarke regular functions;

we apply our derived results to a topical problem to support the value of this kind of algorithm-free statistical analysis which can be validated by a rigorous algorithm if needed.

## 2 Problem Structures and Examples

Many practical statistical estimation and machine learning problems, even though with nonconvex and nondifferentiable objective functions, often have special structures. Supervised learning is a class of machine learning problems that infers a function to map inputs to the outputs , jointly defined on the probability space , where . The objective function of the supervised learning takes the form of

(3) |

where is a univariate loss function measuring the error between a possibly nonconvex nondifferentiable statistical model with the input feature and the output response . In fact, the above function can also be interpreted as an unsupervised learning model when the random variable is absent. In the notation of (1), the pair constitutes the random variable . At this juncture, we should clarify our convention of the probability triple projected onto the input and output spaces and , especially when we want to discuss about properties of the function which involves the input variable only. Letting be the natural projection of the Cartesian product onto , an arbitrary subset can be associated with its inverse image in under ; a statement such as “ has measure one” then means that , which is a subset of , has measure one. A similar meaning holds if is a subset of . In the rest of this paper, this convention is applied to almost sure events in the spaces and . We say that a subset (of either , , or ) is a probability-one set if its probability measure is one.

We are particularly interested in a class of difference-of-max-convex parametric model with the form of

(4) |

where each and are convex differentiable functions from to . This model is pervasive in the contemporary fields of data science. Below we list two such applications.

###### Example 2.1 (Piecewise affine regression).

Linear regression is perhaps the simplest parametric model to estimate the relationship between the response variable and the covariate information . Piecewise linear regression is a generalization of the classical linear regression to enhance the model flexibility. It is known that every piecewise affine function can be written in the form of

with the parameter [48]. Obviously, this piecewise affine model is a special case of the model (4). Taking the quadratic function as the loss measure to estimate the parameter , we obtain the following optimization problem

Notice that the overall objective function in the above optimization problem is nonconvex. More seriously, the nonconvexity and nondifferentiability within the square bracket are coupled. In the special case of the ReLu function, which is basically the plus function (see Example 2.2 below), it was shown in [25, Lemma 57 and below] the expected-value function is not differentiable at the point .

Alternatively, we may take the least absolute deviation as the loss function and consider the robust piecewise affine regression problem

which is again a nonconvex and nonsmooth stochastic optimization problem.

###### Example 2.2 (2-layer neural network model with the ReLu activation function).

Consider a 2-layer neural network model with the rectified linear unit (ReLU) activation function that takes the form of

(5) |

where consists of the two vectors and each in , the matrix , and scalar . The two occurrences of the max ReLu functions indicate the action of 2 hidden layers, where the “max” operation of and is taken componentwise. Variation of the model where only the first layer is subject to the ReLu activation and extensions to more than 2 layers can be similarly treated, although the latter leads to much more complicated formulations. No matter what loss function takes, the square loss, the cross entropy function or the huber loss, the overall objective of admits a coupled nonconvex and nonsmooth structure that is challenging to handle. Nevertheless, we show below that the function (5) can be expressed as the difference of two convex piecewise continuously differentiable functions, thus reducing it to a special case of the model (4). For notational simplicity, we omit the vector since it can be absorbed in as an extra column with redefined by . With this simplification, we derive

Although the terms are not differentiable, they can each be represented as the pointwise maximum of finitely many convex differentiable functions. In fact, we have, with denoting the -th row of the matrix ,

where is a finite set of binary indicators. Substituting the above expression into the function , we see that this function can be written in the form of (4) for some positive integers and and convex functions and that involve the squared plus function: for ; it is easy to check that the latter univariate function is convex, once but not twice continuously differentiable.

## 3 Concepts of Stationarity

Our primary focus in this paper is on the consistency of a sharp kind of stationary solutions of the M-estimation problem (2), which we term a directional stationary point. Let be a locally Lipschitz continuous function defined on an open set . The one-sided directional derivative of at the vector along the direction is defined as

if the limit exists; is said to be directionally differentiable at if exists for all . Recalling that the set is assumed convex, we say is a d(irectional)-stationary point of the program if

The d(irectional)-stationary point, in its dual form, satisfies

where is the indicator function of the set ; i.e., and

is the regular subdifferential of an extended-value function [44, Section 8.B]. A d-stationary point is in contrast to a C(larke)-stationary point [9] which by definition satisfies

where the Clarke subdifferential is:

Unlike the Clarke subdifferential which is outer semicontinuous [9, Proposition 2.1.5]; the regular subdifferential mapping is not “robust”. This is one source of difficulty for analyzing the consistency of the d-stationarity for problem (2) in its general form. Yet, as we will demonstrate in Section 3 via a practical example, analyzing the consistency of a C-stationary point could be meaningless as far as a (local) minimizer is concerned. For evaluation purposes, we note that

(6) |

In the context of (1) with a convex , is a C-stationary point if

(7) |

where, as in standard convex analysis, is the normal cone of at . Similarly, we say is a C-stationary point of (2) if

where the Clarke subdifferential is taken with respect to the variable . Notice that in general we have the inclusions

(8) |

where is taking as the Aumann integration (also called the selection expectation) [36, Definition 1.12], and

(9) |

When both of the functions and are Clarke regular, the above two inclusions become equality. The consistency of C-stationary points under Clarke regularity is established in [53].

## 4 The Composite Difference-max Estimation Problem

In the rest of this paper, we focus on the coupled nonconvex nonsmooth program (2) with the loss function given by the composite function (3) where is a nonnegative convex function and the model is given by (4). The nonnegativity condition of is satisfied by practically all the interesting applications in machine learning and statistical estimation. The special form of the statistical model can be exploited to characterize d-stationarity in terms of certain convex programs. Specifically, we consider the empirical risk minimization problem:

(10) |

where is given by (4), as a sample average approximation of the population model

(11) |

Before proceeding to the mathematical analysis, we should highlight the main technical challenges associated with the above problems. Foremost among these is a workable understanding and characterization of d-stationarity to facilitate the analysis. It turns out that such a characterization (see Lemma 4.3) is available that involves (a) linearizations of the functions and , and (b) the maximizing index sets of the functions and (see below), both varying randomly due to the variable . When embedded in the expectation, such random variations, especially the index sets over which the linearizations are to be chosen, are not easy to treat. Our approach is to employ a notion of stationarity (see Subsection 4.1) that on one hand is computationally tractable and on the other hand is not overly relaxed as Clarke stationarity, which as illustrated by the phase retrieval problem, can be practically meaningless. This constitutes the main contribution of our work.

Throughout, several assumptions will be imposed; the first of which is the following finite mean assumption: for every ,

For any and any nonnegative scalar , we consider the “-argmax” indices of the pointwise max functions and in (4) as elements of the following two sets:

respectively. If , the above sets reduce to the “argmax” indices of and , for which we omit the subscript and write them as

(12) |

Notice the if for all , then for all . A similar remark applies to the family of -functions. In general, the above-defined index sets have the inclusion property stated in the lemma below wherein denotes the (closed) Euclidean ball with center at and radius .

###### Lemma 4.1.

Suppose that there exist positive constants , and and a probability-one subset of such that for all , , and for all and in ,

(13) |

Then, for every scalar , a scalar exists such that for all , all , and all pairs and in satisfying , it holds that and .

###### Proof.

In what follows, the random realization is restricted to be in the set . For any index , we have

similarly,

Thus, for , since , we deduce, for any positive and provided that ,

Hence ; thus . Similarly, we can establish the same inclusion for . ∎

### 4.1 Composite -strong d-stationarity

To facilitate the consistency analysis in the next section, we need to introduce a restriction of d-stationarity for the empirical problem (2) known as -strong d-stationarity that corresponds to a given scalar . The latter restricted concept of stationarity is more stable at the nondifferentiable points of the empirical risk objective.

Given convex functions and on and a convex set , one can equivalently define to be a d-stationary point of the difference-of-convex programming

(15) |

if for all satisfying ,

for an constant ; see, for example, [39, Proposition 5]. In a recent paper [33], the authors introduce a concept called -strong d-stationary solution, which pertains to a point satisfying the above inequality for all such that . Since our problem (10) does not have the dc decomposition as in (15) due to the composition of a convex function and a difference-of-convex function , we are led to the extended -strong d-stationarity concept that is the subject of this subsection.

We start from the following lemma that allows us to characterize a d-stationary point of (10) as an optimal solution of a (nonconvex) optimization problem; see Lemma 4.3.

###### Lemma 4.2.

([10, Lemma 3]) Any univariate convex function can be represented as the sum of a convex non-decreasing function and a convex non-increasing function. Moreover, if the given function is Lipschitz continuous, then so are the two decomposed functions with the same Lipschitz constant.

Applying the above lemma to the function , it follows that there exist a univariate convex non-decreasing function and a univariate convex non-increasing function , both of which are easy to construct from , such that the convex loss function in (3) can be decomposed as

Moreover, if is Lipschitz continuous (see Assumption 4.1(b)), then so are with the same Lipschitz constant. Based on the above decomposition of the latter function, we introduce the following notation for any given , , and in and a nonnegative scalar :

We further denote

(16) |

and their corresponding sum as

(17) |

where we assume all the expectations are finite. When , we will write , and for , and , respectively. Notice that and for all . Furthermore, for a piecewise affine given by (4) where each and are affine as in the piecewise affine regression problem, we have

so that and for all and in . Some of the technical challenges mentioned before in the analysis of the problems (11) and (10) are embodied in the expect-value function and its sampled approximation , which are the main conduits employed in the analysis. Namely, the index sets are varying with the random realization that affects the pointwise maximum selection of the linearizations of and ; upon taking expectations of the random functionals , the behavior of is difficult to pinpoint, which relies on a good understanding of the variations of these random index sets; see Lemmas 4.1 and 4.6.

The following lemma provides a key characterization of a d-stationary point of problem (10). Specifically, (18) characterizes such a point as an optimal solution of a (nonconvex) minimization problem defined by the given point, which is equivalent to finitely many convex programs (20) as demonstrated in the proof.

###### Lemma 4.3.

###### Proof.

It is known from [10, Lemma 5] that is d-stationary for problem (10) if and only if solves the problem

(20) |

Therefore, if the condition (18) holds, then for any and any pair satisfying the above inclusion,

showing that is a d-stationary point for problem (10). Conversely, if is a d-stationary point, then for all ,

which yields