Difference of convex (DC) functions cover a broad family of non-convex and possibly non-smooth and non-differentiable functions, and have wide applications in machine learning and statistics. Although deterministic algorithms for DC functions have been extensively studied, stochastic optimization that is more suitable for learning with big data remains under-explored. In this paper, we propose new stochastic optimization algorithms and study their first-order convergence theories for solving a broad family of DC functions. We improve the existing algorithms and theories of stochastic optimization for DC functions from both practical and theoretical perspectives. On the practical side, our algorithm is more user-friendly without requiring a large mini-batch size and more efficient by saving unnecessary computations. On the theoretical side, our convergence analysis does not necessarily require the involved functions to be smooth with Lipschitz continuous gradient. Instead, the convergence rate of the proposed stochastic algorithm is automatically adaptive to the Hölder continuity of the gradient of one component function. Moreover, we extend the proposed stochastic algorithms for DC functions to solve problems with a general non-convex non-differentiable regularizer, which does not necessarily have a DC decomposition but enjoys an efficient proximal mapping. To the best of our knowledge, this is the first work that gives the first non-asymptotic convergence for solving non-convex optimization whose objective has a general non-convex non-differentiable regularizer.
SO for DC Functions and Non-smooth Non-Convex Regularizers]Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with Non-asymptotic Convergence
First version: November 28, 2018
In this paper, we consider a family of non-convex non-smooth optimization problems that can be written in the following form:
where and are real-valued lower-semicontinuous convex functions, is a proper lower-semicontinuous function. We include the component in order to capture non-differentiable functions that usually plays the role of regularization, e.g., the indicator function of a convex set where is zero if and infinity otherwise, and a non-differential regularizer such as the convex norm or the non-convex norm and norm with . We do not necessarily impose smoothness condition on or and the convexity condition on .
A special class of the problem (1) is the one with being a convex function - also known as difference of convex (DC) functions. We would like to mention that even the family of DC functions is broader enough to cover many interesting non-convex problems that are well-studied, including an additive composition of a smooth non-convex function and a non-smooth convex function, weakly convex functions, etc. We postpone this discussion to Section 2 after we formally introduce the definitions of smooth functions and weakly convex functions.
In the literature, deterministic algorithms for DC problems have been studied extensively since its introduction by Pham Dinh Tao in 1985 and are continuously receiving attention from the community (Khamaru and Wainwright, 2018; Wen et al., 2018). Please refer to (Thi and Dinh, 2018) for a survey on this subject. Although stochastic optimization (SO) algorithms for the special cases of DC functions mentioned above (smooth non-convex functions, weakly convex functions) have been well studied recently (Davis and Grimmer, 2017; Davis and Drusvyatskiy, 2018b, a; Drusvyatskiy and Paquette, 2018; Chen et al., 2018c; Lan and Yang, 2018; Allen-Zhu, 2017; Chen and Yang, 2018; Allen-Zhu and Hazan, 2016; Reddi et al., 2016b, a; Zhang and He, 2018), a comprehensive study of SO algorithms with a broader applicability to the DC functions and the problem (1) with a non-smooth non-convex regularizer still remain rare. The papers by Nitanda and Suzuki (2017) and Thi et al. (2017) are the most related works dedicated to the stochastic optimization of special DC functions. Thi et al. (2017) considered a special class of DC problem whose objective function consists of a large sum of nonconvex smooth functions and a regularization term that can be written as a DC function. They reformulated the problem into (1) such that is a sum of convex functions, and is a quadratic function and is the first component of the DC decomposition of the regularizer. Regarding algorithm and convergence, they proposed a stochastic variant of the classical DCA (Difference-of-Convex Algorithm) and established an asymptotic convergence result for finding a critical point. To our knowledge, the paper by Nitanda and Suzuki (2017) is the probably the first result that gives non-asymptotic convergence for finding an approximate critical point of a special class of DC problem, in which both and can be stochastic functions and . Their algorithm consists of multiple stages of solving a convex objective that is constructed by linearizing and adding a quadratic regularization. However, their algorithm and convergence theory have the following drawbacks. First, at each stage, they need to compute an unbiased stochastic gradient denoted by of such that , where is the accuracy level imposed on the returned solution in terms of the gradient’s norm. In reality, one has to resort to mini-batching technique by using a large number of samples to ensure this condition, which is impractical and not user-friendly. An user has to worry about what is the size of the mini-batch in order to find a sufficiently accurate solution while keeping the computational costs minimal. Second, for each constructed convex subproblem, their theory requires running a stochastic algorithm that solves each subproblem to the accuracy level of , which could waste a lot of computations at earlier stages. Third, their convergence analysis requires that and is a smooth function with a Lipchitz continuous gradient.
Our Contributions - I.
In Section 3, we propose new stochastic optimization algorithms and establish their convergence results for solving the DC class of the problem (1) that improves the algorithm and theory in Nitanda and Suzuki (2017) from several perspectives. It is our intension to address the aforementioned drawbacks of their algorithm and theory. In particular, (i) our algorithm only requires unbiased stochastic (sub)-gradients of and without a requirement on the small variance of the used stochastic (sub)-gradients; (ii) we do not need to solve each constructed subproblem to the accuracy level of . Instead, we allow the accuracy for solving each constructed subproblem to grow slowly without sacrificing the overall convergence rate; (iii) we improve the convergence theory significantly. First, our convergence analysis does not require to be smooth with a Lipchitz continuous gradient. Instead, we only require either or to be differentiable with a Hölder continuous gradient, under the former condition can be a non-smooth non-differentiable function and under the later condition and can be non-smooth non-differentiable functions. Second, the convergence rate is automatically adaptive to the Hölder continuity of the involved function without requiring the knowledge of the Hölder continuity to run the algorithm. Third, when adaptive stochastic gradient method is employed to solve each subproblem, we establish an adaptive convergence similar to existing theory of AdaGrad for convex problems (Duchi et al., 2011; Chen et al., 2018a) and weakly convex problems (Chen et al., 2018c).
Our Contributions - II.
Moreover, in Section 4 we extend our algorithm and theory to the more general class of non-convex non-smooth problem (1), in which is a general non-convex non-differentiable regularizer that enjoys an efficient proximal mapping. Although such kind of non-smooth non-convex regularization has been considered in literature (Attouch et al., 2013; Bolte et al., 2014; Bot et al., 2016; Li and Lin, 2015; Yu et al., 2015; Yang, 2018; Liu et al., 2018; An and Nam, 2017; Zhong and Kwok, 2014), existing results are restricted to deterministic optimization and asymptotic or local convergence analysis. In addition, most of them consider a special case of our problem with being a smooth non-convex function. To the best of our knowledge, this is the first work of stochastic optimization with a non-asymptotic first-order convergence result for tackling the non-convex objective (1) with a non-convex non-differentiable regularization and a smooth function and a possibly non-smooth function with a Hölder continuous gradient. Our algorithm and theory are based on using the Moreau envelope of that can be written as a DC function, which then reduces to the problem that is studied in Section 3. By using the algorithms and their convergence results established in Section 3 and carefully controlling the approximation parameter, we establish the first non-asymptotic convergence of stochastic optimization for solving the original non-convex problem with a non-convex non-differentiable regularizer. This non-asymptotic convergence result can be also easily extended to the deterministic optimization, which itself is novel and could be interesting to a broader community. A summary of our results is presented in Table 1.
|Algorithms for subproblems||Complexity|
|HC||-||CX, HC||SPG, AdaGrad|
|SM||HC||NC, NS, LP||SPG|
|SM||HC||NC, NS, FV, LB||SPG|
|SM||HC||NC, NS, LP||SVRG, AG|
|SM||HC||NC, NS, FV, LB||SVRG, AG|
|SM||HC||NC, NS, FVC||SVRG, AG|
In this section, we present some preliminaries. Let denote the standard -norm with . For a non-convex function , let denote the Fréchet subgradient and denote the limiting subgradient, i.e.,
where the notation means that and . It is known that . If is differential at , then . Moreover, if is continuously differentiable on a neighborhood of , then . When is convex, the Fréchet and the limiting subgradient reduce to the subgradient in the sense of convex analysis: . For simplicity, we use to denote the Euclidean norm (aka. -norm) of a vector. Let denote the distance between two sets.
A function is smooth with a -Lipchitz continuous gradient if it is differentiable and the following inequality holds
A differentiable function has -Hölder continuous gradient if there exists such that
Next, let us characterize the critical points of the considered problem (1) that are standard in the literature (Hiriart-Urruty, 1985; Horst and Thoai, 1999; Thi and Dinh, 2018; An and Nam, 2017), and introduce the convergence measure for an iterative optimization algorithm. First, let us consider the DC problem:
where is a proper lower semicontinuous convex function and is convex. If is a local minimizer of , then . Any point that satisfies the condition is called a stationary point of (2) and any point such that is called a critical point of (2). If is further differentiable, the stationary points and the critical points coincide. For an iterative optimization algorithm, it is hard to find an exactly critical point in a finite-number of iterations. Therefore, one is usually concerned with finding an -critical point that satisfies
Similarly, we can extend the above definitions of stationary and critical points to the general problem (1) with being a proper and lower semi-continuous (possibly non-convex) function (An and Nam, 2017). In particular, is called a stationary point of the considered problem (1) if it satisfies , and any point such that is called a critical point of (1). When is differentiable, (Rockafellar and Wets, 1998)[Exercise 8.8], and when both and are convex and their domains cannot be separated (Rockafellar and Wets, 1998)[Corollary 10.9]. An -critical point of (1) is a point that satisfies . It is notable that when is non-differentiable, finding an -critical point could become a challenging task for an iterative algorithm even under the condition that is a convex function. Let us consider the example of . As long as , we have . To address this challenge when is non-differentiable, we introduce the notion of nearly -critical points. In particular, a point is called a nearly -critical point of the problem (1) if there exists such that
A similar notion of nearly critical points for non-smooth and non-convex optimization problems have been utilized in several recent works (Davis and Grimmer, 2017; Davis and Drusvyatskiy, 2018b, a; Chen et al., 2018c).
Examples and Applications of DC functions.
Before ending this section, we present some examples of DC functions and their applications in machine learning and statistics.
Example 1: Additive composition of a smooth loss function and a non-smooth convex regularizer. Let us consider
where is a convex function and is an -smooth function. For an -smooth function, it is clear that is a convex function. Therefore, the above objective function can be written as - a DC function.
Example 2: Weakly convex functions. Weakly convex functions have been recently studied in numerous papers (Davis and Grimmer, 2017; Davis and Drusvyatskiy, 2018b, a; Chen et al., 2018c; Zhang and He, 2018). A function is called -weakly convex if is a convex function. More generally, is called -relative convex with respect to a strongly convex function if is convex (Zhang and He, 2018). It is obvious that a weakly convex function is a DC function. Examples of weakly convex functions can be found in deep neural networks with a smooth active function and a non-smooth loss function (Chen et al., 2018c), robust learning (Xu et al., 2018), robust phase retrieval (Davis and Drusvyatskiy, 2018a), etc.
Example 3: Non-Convex Sparsity-Promoting Regularizers. Many non-convex sparsity-promoting regularizers in statistics can be written as a DC function, including log-sum penalty (LSP) (Candès et al., 2008), minimax concave penalty (MCP) (Zhang, 2010a), smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001), capped penalty (Zhang, 2010b), transformed norm Zhang and Xin (2014). For detailed DC composition of these regularizers, please refer to (Wen et al., 2018; Gong et al., 2013). It is notable that for LSP, MCP and SCAD, the second function in their DC decomposition can be a smooth function. In particular, if one consider regression or classification with LSP, MCP, SCAD or transformed norm regularizer, the problem is a special case of (1) with being a convex function and being a smooth convex function. Here we give one example by considering learning with MCP as a regularization, where the problem is
where denote a set of data points (feature vector and label pairs), is a convex loss function with respect to its first argument, is a constant and is a regularization parameter. We can write as a difference of two convex functions
where is continuously differentiable with -Lipchitz continuous gradient .
Example 4: Least-squares Regression with Regularization. Recently, a non-convex regularization in the form of was proposed for least-squares regression or compressive sensing (Yin et al., 2015), which is naturally a DC function.
Example 5: Positive-Unlabeled (PU) Learning. A standard learning task is to find a model denoted by that minimizes the expected risk based on a convex surrogate loss , i.e.,
where denotes the feature vector of a random data and denotes its corresponding label. In practice one observes a finite set of i.i.d training data , which leads to the well-known empirical risk (ERM) minimization problem, i.e., . However, if only positive data are observed, ERM becomes problematic. A remedy to address this challenge is to use unlabeled data for computing an unbiased estimation of . In particular, the objective in the following problem is an unbiased risk (Kiryo et al., 2017):
where is a set of unlabeled data, and is the prior probability of the positive class. It is obvious that if is a convex loss function in terms of , the above objective function is a DC function. In practice, an estimation of is used.
Examples of Non-Convex Non-Smooth Regularizers.
Finally, we present some examples of non-convex non-smooth regularizers that cannot be written as a DC function or whose DC composition is unknown. Thus, the algorithms and theories presented in Section 3 are not directly applicable, but the algorithms discussed in Section 4 are applicable when the proximal mapping of each component of is efficient to compute. Examples include norm (i.e., the number of non-zero elements of a vector) and norm regularization for (i.e., ), whose proximal mapping can be efficiently computed (Attouch et al., 2013; Bolte et al., 2014). For another example, let us consider a penalization approach for tackling non-convex constraints. Consider a non-convex optimization problem with domain constraint , where is a non-convex set. Directly handling a non-convex constrained problem could be difficult. An alternative solution is to convert the domain constraint into a penalization with in the objective, where denotes the projection of a point to the set . Note that when is a non-convex set, is a non-convex non-smooth function in general, and its proximal mapping enjoys a closed-form solution (Li and Pong, 2016).
As a final remark, it is worth mentioning that even if can be written as a DC function such that the two components in its DC decomposition are both non-smooth non-differntiable (e.g., regularization, capped norm ), the theory presented in Section 4 can be still useful to derive a non-asymptotic first-order convergence in terms of finding a close critical point, while the theory in Section 3 is not directly applicable.
3 New Stochastic Algorithms of DC functions
In this section, we present new stochastic algorithms for solving the problem (1) when is a convex function and their convergence results. We assume both and have a large number of components such that computing a stochastic gradient is much more efficient than computing a deterministic gradient. Without loss of generality, we assume and , and consider the following problem:
where are real-valued lower-semicontinuous convex functions and is a proper lower-semicontinuous convex function. It is notable that a special case of this problem is the finite-sum form:
which will allows us to develop faster algorithms for smooth functions by using variance reduction techniques.
Since we do not necessarily impose any smoothness assumption on and , we will postpone the particular assumptions for these functions in the statements of later theorems. For all algorithms presented below, we assume that the proximal mapping of can be efficiently computed, i.e., the solution to the following problem can be easily computed for any :
A basic assumption that will be used in the analysis is the following.
For a given initial solution , assume that there exists such that .
The basic idea of the proposed algorithm is similar to the stochastic algorithm proposed in (Nitanda and Suzuki, 2017). The algorithm consists of multiple stages of solving convex problems. At the -th stage (), given a point , a convex majorant function is constructed as following such that and :
where is a constant parameter. Then a stochastic algorithm is employed to optimize the convex majorant function. The key difference from the previous work lies at how to solve each convex majorant function. An important change introduced to our design is to make the proposed algorithms more efficient and more practical. Roughly speaking, we only require solving each function up to an accuracy level of for some constant , i.e., finding a solution such that
In contrast, the algorithm and anlysis presented in (Nitanda and Suzuki, 2017) requires solving each convex problem up to an accuracy level of , which is the expected accuracy level on the final solution. This change not only makes our algorithms more efficient by saving a lot of unnecessary computations but also more practical without requiring to run the algorithm.
We present a meta algorithm in Algorithm 1, in which refers to an appropriate stochastic algorithm for solving each convex majorant function. The Step 4 means that is employed for finding such that (9) is satisfied (or a more fine-grained condition is satisfied for a particular algorithm as discussed later), where denotes the algorithm dependent parameters (e.g., the number of iterations). There are three issues that deserve further discussion in order to fully understand the proposed algorithm. First, how many outer iterations is needed to ensure finding a (nearly) -stationary point of the original problem under the condition that (9) is satisfied for each problem. Second, how to ensure the condition (9) to be satisfied for a stochastic algorithm. Third, what is the overall complexity (iteration complexity or gradient complexity) taking into account the complexity of the stochastic algorithm for solving each convex majorant function. Note that the last two issues are closely related to the particular algorithm employed. We emphasize that the last two issues are important not only in theory but also in practice. Related factors such as how to initialize the algorithm , how to set the step size and how many iterations for each call of suffice have dramatic effect on the practical performance. Next, we first present a general convergence analysis of Algorithm 1 under the condition that (9) is satisfied for solving each problem. Then, we present several representative stochastic algorithms for solving each convex majorant function and derive their overall iteration complexities.
Our convergence analysis also has its merits compared with the previous work (Nitanda and Suzuki, 2017). We will divide our convergence analysis into three parts. First, in subsection 3.1 we introduce a general convergence measure without requiring any smoothness assumptions of involved functions and conduct a convergence analysis of the proposed algorithm. Second, we analyze different stochastic algorithms and their convergence results in subsection 3.2, including an adaptive convergence result for using AdaGrad. Finally, we discuss the implications of these convergence results for solving the original problem in terms of finding a (nearly) -stationary point in subsection 3.3.
3.1 A General Convergence Result
For any , define
It is notable that is well defined since the above problem is strongly convex. The following proposition shows that when , then is a critical point of the original problem.
If , then is a critical point of the problem .
Proof According to the first-order optimality condition, we have
Since , we have
which implies that is a critical point of the original minimization problem.
The above proposition implies that can serve as a measure of convergence of an algorithm for solving the considered minimization problem. In subsection 3.3, we will discuss how the convergence in terms of implies that the standard convergence measure in terms of the (sub)gradient norm of the original problem. The following theorems are the main results of this subsection.
Remark: It is clear that when , in expectation, implying the convergence to a critical point. Note that the factor will lead to an iteration complexity of for using stochastic (sub)gradient method, which is slightly worse than that presented in (Nitanda and Suzuki, 2017) by a logarithmic factor. Nevertheless, such a logarithmic factor can be removed by exploiting non-uniform sampling under a slightly stronger condition of the problem.
Suppose there exists an stochastic algorithm that when applied to can find a solution satisfying (9), and there exists such that for all , then with a total of stages we have
where is sampled according to probabilities with .
Remark: Compared to Theorem 1, the condition for all is slightly stronger than Assumption 1. However, it can be easily satisfied if resides in a bounded set (e.g., when is the indicator function of a bounded set), or if is non-increasing (e.g., when using variance-reduction methods for the case that is smooth).
By the assumption of (9), we have . By the strong convexity of , we have . Thus we have
Rearranging the terms, we have
where the last inequality follows the convexity of . Multiplying both sides by and taking summation over , we have
The second term in the R.H.S of the above inequality can be easily bounded using simple calculus. For the first term, we use similar analysis as that in the proof of Theorem 1 in (Chen et al., 2018c):
where we use . Taking expectation on both sides, we have
Then, we have
3.2 Convergence Results of Different Stochastic Algorithms
In this section, we will present the convergence results of Algorithm 1 for employing different stochastic algorithms for minimizing at each stage. In particular, we consider three representative algorithms, namely stochastic proximal subgradient (SPG) method (Duchi et al., 2010; Zhao and Zhang, 2015), adaptive stochastic gradient (AdaGrad) method (Duchi et al., 2011; Chen et al., 2018a), and proximal stochastic gradient method with variance reduction (SVRG) (Xiao and Zhang, 2014). SPG is a simple stochastic method, AdaGrad allows us to derive adaptive convergence to the history of learning, and SVRG allows us to leverage the finite-sum structure and the smoothness of the problem to improve the convergence rate.
Stochastic Proximal Subgradient Method.
We make the additional assumptions about the problem for developing SPG.
Assume one of the following conditions hold:
is -smooth and there exists such that , where denotes a subgradient such that .
there exists such that , for , and either for a closed convex set or for .
Remark: The first assumption is typically used in the analysis of stochastic gradient method when the involved function is smooth (Zhao and Zhang, 2015), and the second assumption is typically used when the involved function is non-smooth (Duchi et al., 2010). Note that the condition is to capture the indicator function of a convex set. When is the indicator function of a convex set , we have and corresponds to the normal cone of , implying .
Denote by . We present the SPG algorithm in Algorithm 2 with two options to handle smooth and non-smooth separately. The constraint at Step 5 is added to accommodate the proximal mapping of when is non-smooth. When using the subgradient of instead of the proximal mapping of in the update or is the indicator function of a bounded convex set, the constraint can be removed.
We present a proof the above Proposition in the Appendix, which mostly follows existing analysis of SPG or related algorithms. By applying the above results (e.g., the second result in Proposition 2) to the -th stage, we have
where denotes the number of iterations used by SPG for the -th stage. One might directly use the above result to argue that the condition (9) holds by assuming that is bounded, which is true in the non-smooth case due to the domain constraint in the update. In the smooth case, the upper bound is not directly available for setting such that the condition (9) holds. Fortunately, when we apply the above result in the convergence analysis of Algorithm 1, we can utilize the strong convexity of to cancel the term by setting to be larger than a constant.
Let us summarize the convergence of Algorithm 1 when using SPG to solve each subproblem.
Suppose Assumption 2 (i) holds and Algorithm 2 is employed for solving with parameters given in Proposition 2 and with and with , and there exists such that for all , then with a total of stages Algorithm 1 guarantees
Similarly, Suppose Assumption 2 (ii) holds and Algorithm 2 is employed for solving with parameters given in Proposition 2 and with with , and there exists such that for all , then with a total of stages Algorithm 1 guarantees
where is sampled according to probabilities with .
Remark: Let us consider the iteration complexity of using SPG for finding a solution that satisfies . For the non-smooth case, by setting and , we need a total number of stages and total iteration complexity . For the smooth case, by setting we have and total iteration complexity . One can also derive similar results for using the uniform sampling under Assumption 1, which are worse by a logarithmic factor.
AdaGrad (Duchi et al., 2011) is an important algorithm in the literature of stochastic optimization, which uses adaptive step size for each coordinate. It has potential benefit of speeding up the convergence when the cumulative growth of stochastic gradient is slow. Next, we show that AdaGrad can be leveraged to solve each convex majorant function and yield adaptive convergence for the original problem. Similar to previous analysis of AdaGrad (Duchi et al., 2011; Chen et al., 2018a), we make the following assumption.
For any , there exists such that and , either or for .