Stochastic Gradient Descent for Stochastic DoublyNonconvex Composite Optimization
Abstract
The stochastic gradient descent has been widely used for solving composite optimization problems in big data analyses. Many algorithms and convergence properties have been developed. The composite functions were convex primarily and gradually nonconvex composite functions have been adopted to obtain more desirable properties. The convergence properties have been investigated, but only when either of composite functions is nonconvex. There is no convergence property when both composite functions are nonconvex, which is named the doublynonconvex case. To overcome this difficulty, we assume a simple and weak condition that the penalty function is quasiconvex and then we obtain convergence properties for the stochastic doublynonconvex composite optimization problem. The convergence rate obtained here is of the same order as the existing work. We deeply analyze the convergence rate with the constant step size and minibatch size and give the optimal convergence rate with appropriate sizes, which is superior to the existing work. Experimental results illustrate that our method is superior to existing methods.
Stochastic Gradient Descent for Stochastic DoublyNonconvex Composite Optimization
Takayuki Kawashima Department of Statistical Science, The Graduate University for Advanced Studies, Tokyo tkawa@ism.ac.jp Hironori Fujisawa The Institute of Statistical Mathematics, Tokyo fujisawa@ism.ac.jp
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Many optimization problems in machine learning can be written as the following composite optimization problem:
(1) 
Typically, is a convex loss function (e.g., least squared loss) and is a convex and possibly nonsmooth regularization function (e.g., penalty (tibshirani96regression, )). However, it is known that the resulting estimate has a bias. To overcome this problem, nonconvex regularizations such as SCAD (scad, ) and MCP (mcp, ) are becoming popular. Nonconvex loss functions are also becoming popular. One of the most successful nonconvex machine learning applications is a deep learning (hinton2006reducing, ; lecun2015deep, ; goodfellow2016deep, ). For matrix completion, both loss and regularization functions have been extend to nonconvex cases (lu2014generalized, ; sun2016guaranteed, ). In robust statistics, it is known that nonconvex loss functions can show more desirable robust properties than convex loss functions (rousseeuw2005robust, ).
Many algorithms have been developed for the composite optimization problem. To dramatically reduce the computational cost in big data analyses, the stochastic gradient descent (SGD) and its variants have been adopted (xiao2010dual, ; duchi2010composite, ; defazio2014finito, ; NIPS2014_5258, ; AllenZhu:2017:KFD:3055399.3055448, ). Convergence properties of SGD have been intensively studied, but only when is nonconvex and is convex. There is no theoretical property when both composite functions are nonconvex, which is named the doublynonconvex case. To overcome this difficulty, we assume that is quasiconvex in addition to wellused conditions and then we obtain convergence properties for the stochastic doublynonconvex composite optimization problem.
Here, we note that most of the existing works focus on the empirical loss (defazio2014finito, ; NIPS2014_5258, ; AllenZhu:2017:KFD:3055399.3055448, ) This case is called the finitesum stochastic composite optimization problem. On the other hand, the purely stochastic composite optimization problem focuses on the expected loss . These two problems are quite different (doi:10.1137/070704277, ). This paper mainly focuses on the purely stochastic composite optimization problem, but we also have some results for the finitesum stochastic composite optimization problem, using the theoretical results obtained in the purely stochastic composite optimization problem.
Related Work: An asymptotic convergence of the SGD was proved in the seminal work (robbins1951, ). This work was extended to nonasymptotic convergence rates for stochastic convex composite optimization problems including the case (NIPS2009_3689, ; moulines2011non, ; Agarwal2012InformationTheoreticLB, ; rakhlin2012making, ; Shamir:2013:SGD:3042817.3042827, ). In particular, (Lan2012, ; doi:10.1137/110848864, ; doi:10.1137/110848876, ) focused on the stochastic mirror descent method which included the SGD as a special case and succeeded to obtain the nonasymptotic optimal convergence rate in a wider class of algorithm than the SGD. However, most of the results hold only when is convex and is convex or . Recently, the SGD and its variants for nonconvex composite optimization problems have been intensively studied. For the finitesum stochastic setting, variance reduction techniques were proposed and convergence rates were shown (johnson2013accelerating, ; allen2016variance, ; reddi2016stochastic, ). For the purely stochastic setting, (ghadimi2013stochastic, ) investigated theoretical properties when was nonconvex and . They adopted a new output selection scheme, named random selection, and succeeded to give a nonasymptotic convergence rate for the stochastic mirror descent method. (ghadimi2016mini, ; ghadimi2016accelerated, ) adopted the minibatch scheme and extended the previous work (ghadimi2013stochastic, ) to the composite case where was nonconvex and was convex. In particular, (ghadimi2016accelerated, ) obtained a faster rate than that of (ghadimi2016mini, ) by virtue of a acceleration technique. (wang2017stochastic, ) adopted a different minibatch schme, named minibatchprox, and succeeded to prove a convergence property.
Note that is convex or in the existing works. This paper considers the case where is nonconvex as well as . Here we reconsider an advantage of the convexity of . It enables us to regard an update rule as a projection onto a convex set which is derived from a sublevel set of , even when is nonconvex and is nonsmooth. Therefore, we assume that is quasiconvex which implies that the sublevel set of is convex, instead of the convexicity of . It is a broad class and includes many nonconvex penalties. Under the condition of quasiconvexicity of , we show theoretical properties of SGD for the stochastic doublynonconvex composite optimization problem.
Our Contribution:

We show that the SGD converges for the stochastic doublynonconvex composite optimization problem under the simple and weak condition, quasiconvexity, and achieves the same convergence rate as the existing work (ghadimi2016mini, ) except for a constant factor. To the best of our knowledge, our paper is the first work for proving the convergence of the SGD for the stochastic doublynonconvex composite optimization problem.

Our problem formulation is the purely stochastic setting. However, our theoretical results can be easily applied to the finitesum stochastic setting.

We deeply analyze the convergence rate with the constant step size and minibatch size and give the optimal convergence rate with appropriate sizes, which is superior to the existing work (ghadimi2016mini, ).
2 Preliminary
2.1 Notations and Definitions
We present some notations which are used in this paper. Let be the standard inner product on and be the Euclidean norm. For any , denotes the norm. For any real number , and denote the floor function and the ceiling function, respectively.
Definition 2.1.
(Lipschitz smooth) A function is said to be Lipschitz smooth for some if
(2) 
From (2), the following inequality can be derived:
(3) 
Next, we define a quasiconvex function, which plays a key role in this paper.
Definition 2.2.
(Quasiconvex) A function is said to be quasiconvex, if its sublevel set is convex for any .
2.2 Problem Formulation
In problem (1), we suppose the following assumptions:
Assumption 1. The objective function is bounded below; .
Assumption 2. The function is Lipschitz smooth (possibly nonconvex).
Assumption 3. Let be the i.i.d. random variables. For any , instead of the full gradient , we only have access to a noisy gradient , which satisfies
(4)  
(5) 
where is the th iterate parameter and is a positive parameter.
Assumption 4. We assume either (i) or (ii):
(i) The function is separable w.r.t. the parameter with nonnegative weights, more precisely,
where the function is proper lower semicontinuous (possibly nonsmooth) and quasiconvex.
(ii) The function is proper lower semicontinuous (possibly nonsmooth) and quasiconvex.
Discussion of Assumptions: Assumptions 1 and 2 are commonly used in the firstorder nonstochastic and stochastic optimization literatures; Nonstochastic: FISTA (doi:10.1137/080716542, ), GIST (gong2013general, ), mAPG (NIPS2015_5728, ), PALM (Bolte2014, ), Stochastic: SAG (roux2012stochastic, ; schmidt2017minimizing, ), SDCA (shalev2013stochastic, ), SVRG (johnson2013accelerating, ; reddi2016stochastic, ), SAGA (NIPS2014_5258, ), SCSG (lei2017non, ), Katyusha (AllenZhu:2017:KFD:3055399.3055448, ), RSG (ghadimi2013stochastic, ), RSPG (ghadimi2016mini, ), RSAG (ghadimi2016accelerated, ). Instead of Assumption 1, most of these methods assume that a global minimizer exists, , which is stronger than Assumption 1. Assumption 3 is a general assumption in the firstorder stochastic optimization literatures; RSG (ghadimi2013stochastic, ), RSPG (ghadimi2016mini, ), RSAG (ghadimi2016accelerated, ), MP (wang2017stochastic, ). In particular, (4) is known for a firstorder stochastic oracle. Assumption 4 is satisfied for wellknown nonconvex examples of , as will be shown later. It may seem that Assumption 4(i) implies Assumption 4(ii), but this is not true: e.g., . The assumptions except for Assumption 4 are the same as in (ghadimi2016mini, ).
These assumptions cover many applications in machine learning, signal processing and computer vision.
Examples of : (Convex) Linear/Logistic regression (McCullagh:1989, ), tdistribution (maronna2006robust, ). (Nonconvex) Tukey’s biweight (maronna2006robust, ), Matrix completion (lu2014generalized, ), Total variation model (beck2009fast, ), PCA (garber2015fast, ).
Examples of : (Convex) Ridge (hoerl1970ridge, ), (tibshirani96regression, ), elasticnet (Zou05regularizationand, ), Adaptive lasso (doi:10.1198/016214506000000735, ). (Nonconvex) SCAD (scad, ), MCP (mcp, ), Logsum penalty (FRIEDMAN2012722, ; candes2008enhancing, ), Capped penalty (zhang2010analysis, ), penalty (10.23071269656, ; FRIEDMAN2012722, ).
In what follows, we discuss our method under Assumption 4(i) instead of Assumption 4(ii), because most of the examples can be found under Assumption 4(i). The theoretical properties obtained in this paper can also be proved under Assumption 4(ii) in a similar way.
3 SGD for Stochastic DoublyNonconvex Composite Optimization
3.1 Algorithm
Update Rule: We consider the following standard update rule of the minibatch SGD:
(6) 
where , is the size of the minibatch at the th iteration, is the th elements of and is the step size at the th iteration. Then, the update rule (6) can be reduced to the coordinatewise update rule as follows:
(7) 
where and are the th elements of and , respectively.
Proximal Operator: The update rule (7) is equivalent to the proximal operator problem given by:
This problem is nonconvex and a minimizer does not always exist. However, (Bolte2014, ) pointed out that a proximal operator problem with a proper lower semicontinuous function always has a welldefined solution set, i.e., a nonempty and compact solution set. Therefore, exists in all iterative steps. Moreover, some important nonconvex examples, e.g., SCAD, MCP, Logsum penalty, Capped penalty, have closedform solutions (gong2013general, ). In addition, penalty has the closedform solution known as HardThresholding. We illustrate some examples and corresponding closedform solutions in Appendix A1.
Output Selection: The SGD for a convex objective function generally uses the average of iterates as an output. However, for a nonconvex objective function, iterates are not always gathered around a local minimum and the average of the iterates does not work well in a similar way to in a convex case. Therefore, following existing methods such as (ghadimi2013stochastic, ; ghadimi2016mini, ; ghadimi2016accelerated, ; wang2017stochastic, ), we adopt randomized selection, i.e., we select an output randomly from iterates according to a probability mass function . Our method randomly selects the only one output according to . In order to decrease a large deviation of output, (ghadimi2016mini, ) proposed the twophase scheme which randomly selects multiple outputs, and validates them, and then chooses the final output from the validated outputs. In our experiments, we adopted this twophase scheme.
Finally, we give the pseudocodes of our methods by Algorithm 1 and 2.
3.2 Characterization of Update Rule
The update rule (7) can be seen as a Lagrangian relaxation problem (also called Lagrange form) with the Lagrange multiplier and :
(8) 
Then, the original problem (also called constrained form) of (8) may be given by
(9) 
If the objective function in the Lagrangian relaxation problem (8) is convex, a minimizer in (8) also minimizes the original problem (9) under some regularity conditions. It is known as a sufficient optimality condition for convex programming problem (boyd2004convex, ). However, it does not generally hold for nonconvex cases. For a global minimizer in (8), we provide the following sufficient optimality condition.
Proposition 3.1.
Proof.
Remark on Proposition 3.1: We can show that the constraint is a nonempty closed convex set, because is a proper lower semicontinuous quasiconvex function, so that the update rule (7) is regarded as the Euclidean projection onto the convex set from the point of view of the original problem (9). Actually, Proposition 3.1 holds for any nonconvex function , but the corresponding original problem can not be generally regarded as the Euclidean projection onto a convex set, unless is quasiconvex. Recall that important examples of , e.g., SCAD, MCP, Logsum penalty, Capped penalty and penalty, have closedform solutions, and satisfy the assumptions in Proposition 3.1, when is set to .
Another sufficient optimality condition, which focuses on a nonsmooth quasiconvex function, can be found in (IVANOV2008964, ). (IVANOV2008964, ) uses a variant of directional derivative to characterize a nonsmooth stationary condition. In particular, even if is a local minimizer, the sufficient optimality condition in (IVANOV2008964, ) holds. Our Lagrangian relaxation problem (8) and the corresponding original problem (9) satisfy the assumptions supposed by (IVANOV2008964, ). Therefore, we can adopt the sufficient optimality condition in (IVANOV2008964, ), which is deeply discussed in Appendix A2.
Relation between Full Gradient and Stochastic Gradient: The following lemma shows that the Euclidean projection is a nonexpansive mapping.
Lemma 3.1.
Let be a convex set. The Euclidean projection onto the convex set is defined by . Then, we have
This is a classical wellknown Lemma (see, Corollary 12.20 in (rockafellar2009variational, )). Let the update rule based on the full gradient be defined by
(12) 
We provide the following proposition.
Proposition 3.2.
Let and be the th elements of and , respectively. Then, we have
(13) 
Proof.
We see from Remark on Proposition 3.1 that is the Euclidean projection onto a convex set. The updated rule (12) can be reduced to a coordinatewise update rule and then it is regarded as the Euclidean projection onto a convex set in a similar manner to . Replacing and by and , respectively, in Lemma 3.1, then we have (13). ∎
Let the objective function in (12) be denoted by . Since is a global minimizer of , we have . Then, it follows from this inequality and (3) that the target function decreases as is changed from to under . The update rule (7) for the stochastic doublynonconvex composite optimization problem does not have such a desirable property. However, Proposition 3.2 implies that such a desirable property approximately holds, although it depends on the accuracy of the approximation of the noisy gradient to the full gradient. In addition, Proposition 3.2 plays a key role in the proof of convergence property.
3.3 Convergence Analysis
Convergence Property: Let us define
(14) 
We obtain the following convergence property. The proof is in Appendix A3.
Theorem 3.1.
Suppose that the step sizes ’s are chosen such that with for at least one , and the probability mass function is chosen such that for ,
(15) 
Then, we have
(16) 
where the expectation is taken with respect to and ’s, and .
Remark on Theorem 3.1: (ghadimi2016mini, ) studied the convergence rate in the case of a nonconvex loss function with a convex penalty . Even when is nonconvex, we have succeeded to attain the same convergence rate as in Theorem 2(a) of (ghadimi2016mini, ) except for a constant factor. Actually, we can obtain a better convergence rate than that of (ghadimi2016mini, ) under a specific setting of some parameters, as will be shown later. Moreover, we can obtain the same convergence rate even under the finitesum stochastic setting in a similar manner.
Here, we deeply consider the step size and minibatch size. For a stochastic convex optimization problem, a decreasing step size, e.g., , is generally used, which can guarantee the convergence in expectation (see, e.g., Chapter 6 in (bubeck2015convex, )). However, a decreasing step size is not suitable for our method, because the selecting probability (15) with a decreasing step size implies that early iterates tend to be selected, although later iterates are expected to be more proper than early iterates. Therefore we consider the constant step size. The minibatch size is closely related to the accuracy of the approximation of the noisy gradient to the full gradient. This accuracy is important, as seen in Proposition 3.2. The increasing/decreasing minibatch size gives a better approximation at later/early iterates. It is not clear which idea is better, because it depends on the problem. Therefore we consider the constant minibatch size. In the constant size case, we can obtain the following theorem. The proof is in Appendix A4.
Theorem 3.2.
Suppose that the step sizes and minibatch sizes are constant, i.e., and for all , and the probability mass function is chosen as (15). Then, we have
(17) 
where .
How to Select Minibatch Size: The bound (17) depends on the iteration limit and the minibatch size . These are closely related to the total number of the sequences , say . Here we consider the case for simplicity, because is at most . In this case, the bound (17) is minimized at . If we know , and , then we can use this optimal value of to attain the optimal convergence rate. Two values and can be estimated, but it is difficult to estimate , so that is often replaced by an appropriate value we can set (see (ghadimi2016mini, ) for details). The resulting value of may not be a positive integer and not be smaller than . Therefore, we propose
(18) 
and then we obtain the following theorem. The proof is in Appendix A5.
Theorem 3.3.
Remark on Theorem 3.3: Suppose that is relatively large. When is the ideal value , (19) reduces to
(20) 
In the view of the convergence speed on , we focus on the dominant term in the bound (20), i.e., the second term of the righthand side in (20). We can easily show that this term attains the minimum when . Here we compare two convergence rates:
Focusing on the dominant term in terms of the convergence speed on , we can easily see that the latter bound is smaller than the former bound, i.e., our convergence rate with is better than that in the SGD case of (4.23) in (ghadimi2016mini, ).
4 Experiments
We present numerical experiment results on representable machine learning task: Classification. All results were obtained in R version 3.3.0 with Intel Core i74790K machine.
Problem Formulation: We consider the following regularized logistic regression problem:
where and represents a class label. For , we use SCAD, Capped penalty and penalty.
Dataset: We used several realworld datasets, which were available at the UCI Machine Learning Repository (Dua:2017, ). Table 1 shows the detail of datasets. All datasets were normalized and divided into training and test in advance.
Dataset  Training Size  Test Size  Features 

MAGIC Gamma Telescope (MGT)  15000  4020  11 
MiniBooNE particle identification (MBpi)  70000  60064  50 
Parameter Setting: The initial point was set to be random and then we generated different initial points randomly. We estimated and by using relatively small size of subsamples, which were drawn from training data. For estimating and , we followed the way to in Sect. 6 in (ghadimi2016mini, ). This subsamples were only used for estimating and and the size of this subsamples was set to . For the two phase scheme, we used samples for the validation and randomly selected outputs. The step size was set to be . The minibatch size was set to be the same as in Corollary 3.3. The tuning parameter was set to , because the objective function is nonnegative, , and then . The tuning parameter was set to . The tuning parameter of was set to be for , and was set to .
Convergence Criterion: To verify the convergence, we used the modified whose full gradient was approximated by using test data, i.e., . For comparative methods, was replaced by the final output .
Comparative Method: There is no existing SGD method which guarantees the convergence for stochastic doublynonconvex composite optimization. Therefore, we compared our method with the averaging SGD (ASGD) and polynomialdecay averaging SGD (PDSGD), which guarantee the convergence for the stochastic convex composite optimization problem. ASGD and PDSGD use the averages and , respectively. The final output was set to be and for ASGD and PDSGD, respectively. Moreover, we incorporated the minibatch scheme into the comparative methods. The minibatch size was set to be the same as that in our setting.
Result: Table 2 shows the average of the convergence criterion with different initial points. Our method outperformed the comparative methods. As the tuning parameter was larger, our method tended to show smaller values, but the comparative methods rather larger. When , our method was much better than the comparative methods. This would be because a nonconvex effect is larger as is larger. The sample size of the MBpi dataset is larger than that of the MGT dataset, although the number of features of the former is also larger than that of the latter. Our method gave much smaller values for the MBpi dataset than the MGT dataset. The comparative methods did not show such a behavior and rather presented worse behaviors at some cases.
Dataset  Tuning parameter  Methods  SCAD  Capped penalty  penalty 

MGT  Our method  0.204  0.188  0.173  
ASGD  0.338  0.343  0.644  
PDSGD  0.271  0.276  0.844  
Our method  0.184  0.267  0.174  
ASGD  0.338  0.475  2.79  
PDSGD  0.271  0.406  2.22  
Our method  0.173  0.125  0.173  
ASGD  0.512  2.8  9.85  
PDSGD  0.575  2.14  5.46  
MBpi  Our method  0.0297  0.0487  0.0224  
ASGD  0.151  0.168  1.12  
PDSGD  0.055  0.0808  1.05  
Our method  0.0259  0.0281  0.0221  
ASGD  0.154  0.648  3.87  
PDSGD  0.0579  0.557  3.1  
Our method  0.022  0.00781  0.02  
ASGD  1.05  4.32  15.1  
PDSGD  0.841  1.8  11.9 
References
 (1) R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), Vol. 58, pp. 267–288, 1996.
 (2) Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, Vol. 96, No. 456, pp. 1348–1360, 2001.
 (3) CunHui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, Vol. 38, No. 2, pp. 894–942, 04 2010.
 (4) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, Vol. 313, No. 5786, pp. 504–507, 2006.
 (5) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, Vol. 521, No. 7553, p. 436, 2015.
 (6) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, Vol. 1. MIT press Cambridge, 2016.
 (7) Canyi Lu, Jinhui Tang, Shuicheng Yan, and Zhouchen Lin. Generalized nonconvex nonsmooth lowrank minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4130–4137, 2014.
 (8) Ruoyu Sun and ZhiQuan Luo. Guaranteed matrix completion via nonconvex factorization. IEEE Transactions on Information Theory, Vol. 62, No. 11, pp. 6535–6579, 2016.
 (9) Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection, Vol. 589. John wiley & sons, 2005.
 (10) Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, Vol. 11, No. Oct, pp. 2543–2596, 2010.
 (11) John C Duchi, Shai ShalevShwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, pp. 14–26, 2010.
 (12) Aaron Defazio, Justin Domke, et al. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pp. 1125–1133, 2014.
 (13) Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pp. 1646–1654. Curran Associates, Inc., 2014.
 (14) Zeyuan AllenZhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp. 1200–1205, New York, NY, USA, 2017. ACM.
 (15) A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, Vol. 19, No. 4, pp. 1574–1609, 2009.
 (16) Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., Vol. 22, No. 3, pp. 400–407, 09 1951.
 (17) Alekh Agarwal, Martin J Wainwright, Peter L. Bartlett, and Pradeep K. Ravikumar. Informationtheoretic lower bounds on the oracle complexity of convex optimization. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pp. 1–9. Curran Associates, Inc., 2009.
 (18) Eric Moulines and Francis R Bach. Nonasymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451–459, 2011.
 (19) Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Informationtheoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, Vol. 58, pp. 3235–3249, 2012.
 (20) Alexander Rakhlin, Ohad Shamir, Karthik Sridharan, et al. Making gradient descent optimal for strongly convex stochastic optimization. In ICML. Citeseer, 2012.
 (21) Ohad Shamir and Tong Zhang. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, pp. I–71–I–79. JMLR.org, 2013.
 (22) Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, Vol. 133, No. 1, pp. 365–397, Jun 2012.
 (23) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, Vol. 22, No. 4, pp. 1469–1492, 2012.
 (24) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization, Vol. 23, No. 4, pp. 2061–2089, 2013.
 (25) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315–323, 2013.
 (26) Zeyuan AllenZhu and Elad Hazan. Variance reduction for faster nonconvex optimization. In International Conference on Machine Learning, pp. 699–707, 2016.
 (27) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314–323, 2016.
 (28) Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, Vol. 23, No. 4, pp. 2341–2368, 2013.
 (29) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Minibatch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, Vol. 155, No. 12, pp. 267–305, 2016.
 (30) Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, Vol. 156, No. 12, pp. 59–99, 2016.
 (31) Weiran Wang and Nathan Srebro. Stochastic nonconvex optimization with large minibatches. arXiv preprint arXiv:1709.08728, 2017.
 (32) Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, Vol. 2, No. 1, pp. 183–202, 2009.
 (33) Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Huang, and Jieping Ye. A general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems. In International Conference on Machine Learning, pp. 37–45, 2013.
 (34) Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex programming. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pp. 379–387. Curran Associates, Inc., 2015.
 (35) Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, Vol. 146, No. 1, pp. 459–494, Aug 2014.
 (36) Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pp. 2663–2671, 2012.
 (37) Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, Vol. 162, No. 12, pp. 83–112, 2017.
 (38) Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, Vol. 14, No. Feb, pp. 567–599, 2013.
 (39) Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex finitesum optimization via scsg methods. In Advances in Neural Information Processing Systems, pp. 2345–2355, 2017.
 (40) P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, London, 1989.
 (41) R.A. Maronna, D.R. Martin, and V.J. Yohai. Robust Statistics: Theory and Methods. Wiley Series in Probability and Statistics. Wiley, 2006.
 (42) Amir Beck and Marc Teboulle. Fast gradientbased algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing, Vol. 18, No. 11, pp. 2419–2434, 2009.
 (43) Dan Garber and Elad Hazan. Fast and simple pca via convex optimization. arXiv preprint arXiv:1509.05647, 2015.
 (44) Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, Vol. 12, No. 1, pp. 55–67, 1970.
 (45) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, Vol. 67, pp. 301–320, 2005.
 (46) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, Vol. 101, No. 476, pp. 1418–1429, 2006.
 (47) Jerome H. Friedman. Fast sparse regression and classification. International Journal of Forecasting, Vol. 28, No. 3, pp. 722 – 738, 2012.
 (48) Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications, Vol. 14, No. 56, pp. 877–905, 2008.
 (49) Tong Zhang. Analysis of multistage convex relaxation for sparse regularization. Journal of Machine Learning Research, Vol. 11, No. Mar, pp. 1081–1107, 2010.
 (50) Ildiko E. Frank and Jerome H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, Vol. 35, No. 2, pp. 109–135, 1993.
 (51) Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 (52) Vsevolod I. Ivanov. On the functions with pseudoconvex sublevel sets and optimality conditions. Journal of Mathematical Analysis and Applications, Vol. 345, No. 2, pp. 964 – 974, 2008.
 (53) R Tyrrell Rockafellar and Roger JB Wets. Variational analysis, Vol. 317. Springer Science & Business Media, 2009.
 (54) Sebastien Bubeck, et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, Vol. 8, No. 34, pp. 231–357, 2015.
 (55) Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
Appendix
A1: Nonconvex Examples of
We show nonconvex examples and closedform solutions given by [33]. For the ease of notation, we remove subscripts and denote the corresponding proximal operator problem as . The tuning parameters and .
SCAD: It has the following form.
The closedform solution is given by
where
MCP: It has the following form
The closedform solution given by
where and with .
Logsum penalty: It has the following form
The closedform solution is given by
where , ,
and
Capped penalty: It has the following form
The closedform solution is given by
where and
penalty: It has the following form
The closedform solution given by
where and .
A2: Sufficient Optimality Condition in [52]
In this section, we modify and show the sufficient optimality condition in [52] in order to apply
our method.
We follow many notations given by [52].
For more detailed descriptions and
proofs, we refer to [52].
Let be a set in the Euclidean space . Let denote closed hull of the set .
We denote the neighbourhood by
We denote the sublevel set of the function at by
We define the Bouligand tangent cone of the set at by
Definition .1.
[Bouligand tangent cone]
We define the lower Hadamard directional derivative of the function at in direction by
Definition .2.
[Lower Hadamard directional derivative]
If the function is differentiable, it reduces to .
We define the strongly pseudoconvex sublevel set by
Definition .3.
[Strongly pseudoconvex sublevel set] Let and . If for all and , there are a number and sequences , such that
The sublevel set is said to be strongly pseudoconvex.
Actually, any differentiable strictly convex function satisfy this definition. For example, satisfies this definition because it is differentiable strong convex.
We consider the following optimization problem (P):
Then, we show the following modified sufficient optimality condition in [52].
Theorem .1.
[The modified Sufficient Optimality Condition of Theorem 10 in [52] ] Let be a feasible point of (P). Assume that is differentiable, that its sublevel set is strongly pseudoconvex, that is quasiconvex, that , and that there exists
If the following condition for is satisfied,
then, is unique global minimizer of (P).
In our problem formulation, corresponds to and corresponds to and and . When we set to , the assumption is satisfied. As showed before, has the strongly pseudoconvex sublevel set. If is a local minimizer of , it follows that with for some . Then, we see that for any direction . Therefore, our problem formulation with a local minimizer satisfies the conditions of the modified Sufficient Optimality Condition of Theorem 10 in [52].
A3: Proof of Theorem 3.1
Proof.
Replacing and by and , respectively, in (3). Then, we see from (14) that
(21) 
Let the objective function in (7) be denoted by . Since is a global minimizer of , we have