ZerothOrder Stochastic Alternating Direction Method of Multipliers
for Nonconvex Nonsmooth Optimization
Abstract
Alternating direction method of multipliers (ADMM) is a popular optimization tool for the composite and constrained problems in machine learning. However, in many machine learning problems such as blackbox attacks and bandit feedback, ADMM could fail because the explicit gradients of these problems are difficult or infeasible to obtain. Zerothorder (gradientfree) methods can effectively solve these problems due to that the objective function values are only required in the optimization. Recently, though there exist a few zerothorder ADMM methods, they build on the convexity of objective function. Clearly, these existing zerothorder methods are limited in many applications. In the paper, thus, we propose a class of fast zerothorder stochastic ADMM methods (i.e., ZOSVRGADMM and ZOSAGAADMM) for solving nonconvex problems with multiple nonsmooth penalties, based on the coordinate smoothing gradient estimator. Moreover, we prove that both the ZOSVRGADMM and ZOSAGAADMM have convergence rate of , where denotes the number of iterations. In particular, our methods not only reach the best convergence rate for the nonconvex optimization, but also are able to effectively solve many complex machine learning problems with multiple regularized penalties and constraints. Finally, we conduct the experiments of blackbox binary classification and structured adversarial attack on blackbox deep neural network to validate the efficiency of our algorithms.
ZerothOrder Stochastic Alternating Direction Method of Multipliers
for Nonconvex Nonsmooth Optimization
Feihu Huang, Shangqian Gao, Songcan Chen, Heng Huang^{*}^{*}footnotemark: *
Department of Electrical & Computer Engineering, University of Pittsburgh, USA
College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis & Machine Intelligence, China
JD Finance America Corporation
feh23@pitt.edu, shg84@pitt.edu, s.chen@nuaa.edu.cn, heng.huang@pitt.edu
1 Introduction
Alternating direction method of multipliers (ADMM [?; ?]) is a popular optimization tool for solving the composite and constrained problems in machine learning. In particular, ADMM can efficiently optimize some problems with complicated structure regularization such as the graphguided fused lasso [?], which are too complicated for the other popular optimization methods such as proximal gradient methods [?]. Thus, ADMM has been widely studied in recent years [?]. For the largescale optimization, the stochastic ADMM method [?] has been proposed. Due to variances of the stochastic gradient, however, these methods suffer from a slow convergence rate. To speedup the convergence, recently, some faster stochastic ADMM methods [?; ?] have been proposed by using the variance reduced (VR) techniques such as the SVRG [?]. In fact, ADMM is also highly successful in solving various nonconvex problems such as tensor decomposition [?] and learning neural networks [?]. Thus, some fast nonconvex stochastic ADMM methods have been developed in [?].
Currently, most of the ADMM methods need to compute gradients of the loss functions over each iteration. However, in many machine learning problems, the explicit expression of gradient for objective function is difficult or infeasible to obtain. For example, in blackbox situations, only prediction results (i.e., function values) are provided [?; ?]. In bandit settings [?], the player only receives partial feedback in terms of loss function values, so it is impossible to obtain expressive gradient of the loss function. Clearly, the classic optimization methods, based on the firstorder gradient or secondorder information, are not competent to these problems. Thus, zerothorder optimization methods [?; ?] are developed by only using the function values in the optimization.
Algorithm  Reference  Gradient Estimator  Problem  Convergence Rate 
ZOOADMM  [?]  GauSGE  C(S) + C(NS)  
ZOGADM  [?]  UniSGE  C(S) + C(NS)  
RSPGF  [?]  GauSGE  NC(S) + C(NS)  
ZOProxSVRG  [?]  CooSGE  NC(S) + C(NS)  
ZOProxSAGA  
ZOSVRGADMM  Ours  CooSGE  NC(S) + C(mNS)  
ZOSAGAADMM 
In the paper, we focus on using the zerothorder methods to solve the following nonconvex nonsmooth problem:
(1)  
s.t. 
where , for all , is a nonconvex and smooth function, and each is a convex and nonsmooth function. In machine learning, function can be used for the empirical loss, for multiple structure penalties (e.g., sparse + group sparse), and the constraint for encoding the structure pattern of model parameters such as graph structure. Due to the flexibility in splitting the objective function into loss and each penalty , ADMM is an efficient method to solve the above constricted problem. However, in the problem (1), we only access the objective values rather than the explicit function , thus the classic ADMM methods are unsuitable for this problem.
Recently, [?; ?] proposed the zerothorder stochastic ADMM methods, which only use the objective values to optimize. However, these zerothorder ADMMbased methods build on the convexity of objective function. Clearly, these methods are limited in many applications such as adversarial attack on blackbox deep neural network (DNN). Due to that the problem (1) includes multiple nonsmooth regularization functions and constraint, the existing nonconvex zerothorder algorithms [?; ?; ?] are not suitable for this problem.
In the paper, thus, we propose a class of fast zerothorder stochastic ADMM methods (i.e., ZOSVRGADMM and ZOSAGAADMM) to solve the problem (1) based on the coordinate smoothing gradient estimator [?]. In particular, the ZOSVRGADMM and ZOSAGAADMM methods build on the SVRG [?] and SAGA [?], respectively. Moreover, we study the convergence properties of the proposed methods. Table 1 shows the convergence properties of the proposed methods and other related ones.
1.1 Challenges and Contributions
Although both SVRG and SAGA show good performances in the firstorder and secondorder methods, applying these techniques to the nonconvex zerothorder ADMM method is not trivial. There exists at least two main challenges:

Due to failure of the Fejér monotonicity of iteration, the convergence analysis of the nonconvex ADMM is generally quite difficult [?]. With using the inexact zerothorder estimated gradient, this difficulty becomes greater in the nonconvex zerothorder ADMM methods.

To guarantee convergence of our zerothorder ADMM methods, we need to design a new effective Lyapunov function, which can not follow the existing nonconvex (stochastic) ADMM methods [?; ?].
Thus, we carefully establish the Lyapunov functions in the following theoretical analysis to ensure convergence of the proposed methods. In summary, our major contributions are given below:

We propose a class of fast zerothorder stochastic ADMM methods (i.e., ZOSVRGADMM and ZOSAGAADMM) to solve the problem (1).

We prove that both the ZOSVRGADMM and ZOSAGAADMM have convergence rate of for nonconvex nonsmooth optimization. In particular, our methods not only reach the existing best convergence rate for the nonconvex optimization, but also are able to effectively solve many machine learning problems with multiple complex regularized penalties.

Extensive experiments conducted on blackbox classification and structured adversarial attack on blackbox DNNs validate efficiency of the proposed algorithms.
2 Related Works
Zerothorder (gradientfree) optimization is a powerful optimization tool for solving many machine learning problems, where the gradient of objective function is not available or computationally prohibitive. Recently, the zerothorder optimization methods are widely applied and studied. For example, zerothorder optimization methods have been applied to bandit feedback analysis [?] and blackbox attacks on DNNs [?; ?]. [?] have proposed several random zerothorder methods by using Gaussian smoothing gradient estimator. To deal with the nonsmooth regularization, [?; ?] have proposed the zerothorder online/stochastic ADMMbased methods.
So far, the above algorithms mainly build on the convexity of problems. In fact, the zerothorder methods are also highly successful in solving various nonconvex problems such as adversarial attack to blackbox DNNs [?]. Thus, [?; ?; ?] have begun to study the zerothorder stochastic methods for the nonconvex optimization. To deal with the nonsmooth regularization, [?; ?] have proposed some nonconvex zerothorder proximal stochastic gradient methods. However, these methods still are not well competent to some complex machine learning problems such as a task of structured adversarial attack to the blackbox DNNs, which is described in the following experiment.
2.1 Notations
Let and for . Given a positive definite matrix , ; and denote the largest and smallest eigenvalues of , respectively, and . and denote the largest and smallest eigenvalues of matrix .
3 Preliminaries
In the section, we begin with restating a standard approximate stationary point of the problem (1), as in [?; ?].
Definition 1.
Given , the point is said to be an approximate stationary point of the problems (1), if it holds that
(2) 
where ,
Next, we make some mild assumptions regarding problem (1) as follows:
Assumption 1.
Each function is smooth for such that
which is equivalent to
Assumption 2.
Gradient of each function is bounded, i.e., there exists a constant such that for all , it follows that .
Assumption 3.
and for all are all lower bounded, and denote and for .
Assumption 4.
is a full row or column rank matrix.
Assumption 1 has been commonly used in the convergence analysis of nonconvex algorithms [?]. Assumption 2 is widely used for stochastic gradientbased and ADMMtype methods [?]. Assumptions 3 and 4 are usually used in the convergence analysis of ADMM methods [?; ?]. Without loss of generality, we will use the full column rank of matrix in the rest of this paper.
4 Fast ZerothOrder Stochastic ADMMs
In this section, we propose a class of zerothorder stochastic ADMM methods to solve the problem (1). First, we define an augmented Lagrangian function of the problem (1) as follows:
(3) 
where and denotes the dual variable and penalty parameter, respectively.
In the problem (1), the explicit expression of objective function is not available, and only the function value of is available. To avoid computing explicit gradient, thus, we use the coordinate smoothing gradient estimator [?] to estimate gradients: for ,
(4) 
where is a coordinatewise smoothing parameter, and is a standard basis vector with 1 at its th coordinate, and 0 otherwise.
Based on the above estimated gradients, we propose a zerothorder ADMM (ZOADMM) method to solve the problem (1) by executing the following iterations, for
(5) 
where the term with to linearize the term . Here, due to using the inexact zerothorder gradient to update , we define an approximate function over as follows:
(6) 
where , is the zerothorder gradient and is a step size. Considering the matrix is large, set with to linearize the term .
In the problem (1), not only the noisy gradient of is not available, but also the sample size is very large. Thus, we propose fast ZOSVRGADMM and ZOSAGAADMM to solve the problem (1), based on the SVRG and SAGA, respectively.
Algorithm 1 shows the algorithmic framework of ZOSVRGADMM. In Algorithm 1, we use the estimated stochastic gradient with . We have , i.e., this stochastic gradient is a biased estimate of the true full gradient. Although the SVRG has shown a great promise, it relies upon the assumption that the stochastic gradient is an unbiased estimate of true full gradient. Thus, adapting the similar ideas of SVRG to zerothorder ADMM optimization is not a trivial task. To handle this challenge, we choose the appropriate step size , penalty parameter and smoothing parameter to guarantee the convergence of our algorithms, which will be discussed in the following convergence analysis.
5 Convergence Analysis
In this section, we will study the convergence properties of the proposed algorithms (ZOSVRGADMM and ZOSAGAADMM). For notational simplicity, let
5.1 Convergence Analysis of ZOSVRGADMM
In this subsection, we analyze convergence properties of the ZOSVRGADMM.
Given the sequence is generated from Algorithm 1, we define a Lyapunov function:
where the positive sequence satisfies
In addition, we definite a useful variable ].
Theorem 1.
Remark 1.
Theorem 1 shows that given , , , and , the ZOSVRGADMM has convergence rate of . Specifically, when , given , the ZOSVRGADMM has convergence rate of ; when , given , it has convergence rate of ; when , given , it has convergence rate of . Thus, the ZOSVRGADMM has the optimal function query complexity of for finding an approximate local solution.
5.2 Convergence Analysis of ZOSAGAADMM
In this subsection, we provide the convergence analysis of the ZOSAGAADMM.
Given the sequence is generated from Algorithm 2, we define a Lyapunov function
where the positive sequence satisfies
In addition, we definite a useful variable .
Theorem 2.
Remark 2.
Theorem 2 shows that , , and , the ZOSAGAADMM has the of convergence rate. Specifically, when , given , the ZOSAGAADMM has convergence rate of ; when , given , it has convergence rate of ; when , given , it has convergence rate of . Thus, the ZOSAGAADMM has the optimal function query complexity of for finding an approximate local solution.
6 Experiments
In this section, we compare our algorithms (ZOSVRGADMM, ZOSAGAADMM) with the ZOProxSVRG, ZOProxSAGA [?], the deterministic zerothorder ADMM (ZOADMM), and zerothorder stochastic ADMM (ZOSGDADMM) without variance reduction on two applications: 1) robust blackbox binary classification, and 2) structured adversarial attacks on blackbox DNNs.
datasets  #samples  #features  #classes 

20news  16,242  100  2 
a9a  32,561  123  2 
w8a  64,700  300  2 
covtype.binary  581,012  54  2 
6.1 Robust BlackBox Binary Classification
In this subsection, we focus on a robust blackbox binary classification task with graphguided fused lasso. Given a set of training samples , where and , we find the optimal parameter by solving the problem:
(7) 
where is the blackbox loss function, that only returns the function value given an input. Here, we specify the loss function , which is the nonconvex robust correntropy induced loss [?]. Matrix decodes the sparsity pattern of graph obtained by sparse inverse covariance selection, as in [?]. In the experiment, we give minibatch size , smoothing parameter and penalty parameters .
In the experiment, we use some public real datasets^{1}^{1}120news is from https://cs.nyu.edu/~roweis/data.html; others are from www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/., which are summarized in Table 2. For each dataset, we use half of the samples as training data and the rest as testing data. Figure 1 shows that the objective values of our algorithms faster decrease than the other algorithms, as the CPU time increases. In particular, our algorithms show better performances than the zerothorder proximal algorithms. It is relatively difficult that these zerothorder proximal methods deal with the nonsmooth penalties in the problem (7). Thus, we have to use some iterative methods (such as the classic ADMM method) to solve the proximal operator in these proximal methods.
6.2 Structured Attacks on BlackBox DNNs
In this subsection, we use our algorithms to generate adversarial examples to attack the pretrained DNN models, whose parameters are hidden from us and only its outputs are accessible. Moreover, we consider an interesting problem: “What possible structures could adversarial perturbations have to fool blackbox DNNs ?” Thus, we use the zerothorder algorithms to find an universal structured adversarial perturbation that could fool the samples , which can be regarded as the following problem:
(8) 
where represents the final layer output before softmax of neural network, and ensures the validness of created adversarial examples. Specifically, if for all and , otherwise . Following [?], we use the overlapping lasso to obtain structured perturbations. Here, the overlapping groups generate from dividing an image into subgroups of pixels.
In the experiment, we use the pretrained DNN models on MNIST and CIFAR10 as the target blackbox models, which can attain and test accuracy, respectively. For MNIST, we select 20 samples from a target class and set batch size ; For CIFAR10, we select 30 samples and set . In the experiment, we set , where and for MNIST and CIFAR10, respectively. At the same time, we set the parameters , , and . For both datasets, the kernel size for overlapping group lasso is set to and the stride is one.
Figure 3 shows that attack losses (i.e. the first term of the problem (6.2)) of our methods faster decrease than the other methods, as the number of iteration increases. Figure 2 shows that our algorithms can learn some structure perturbations, and can successfully attack the corresponding DNNs.
7 Conclusions
In the paper, we proposed fast ZOSVRGADMM and ZOSAGAADMM methods based on the coordinate smoothing gradient estimator, which only uses the objective function values to optimize. Moreover, we prove that the proposed methods have a convergence rate of . In particular, our methods not only reach the existing best convergence rate for the nonconvex optimization, but also are able to effectively solve many machine learning problems with the complex nonsmooth regularizations.
Acknowledgments
F.H., S.G., H.H. were partially supported by U.S. NSF IIS 1836945, IIS 1836938, DBI 1836866, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956. S.C. was partially supported by the NSFC under Grant No. 61806093 and No. 61682281, and the Key Program of NSFC under Grant No. 61732006.
References
 [Agarwal et al., 2010] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multipoint bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
 [Beck and Teboulle, 2009] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 [Boyd et al., 2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
 [Chen et al., 2017] PinYu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and ChoJui Hsieh. Zoo: Zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. In Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
 [Defazio et al., 2014] Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
 [Duchi et al., 2015] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zeroorder convex optimization: The power of two function evaluations. IEEE TIT, 61(5):2788–2806, 2015.
 [Gabay and Mercier, 1976] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
 [Gao et al., 2018] Xiang Gao, Bo Jiang, and Shuzhong Zhang. On the informationadaptive variants of the admm: an iteration complexity perspective. Journal of Scientific Computing, 76(1):327–363, 2018.
 [Ghadimi and Lan, 2013] Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 [Ghadimi et al., 2016] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Minibatch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(12):267–305, 2016.
 [Gu et al., 2018] Bin Gu, Zhouyuan Huo, Cheng Deng, and Heng Huang. Faster derivativefree stochastic algorithm for shared memory machines. In ICML, pages 1807–1816, 2018.
 [He et al., 2011] Ran He, WeiShi Zheng, and BaoGang Hu. Maximum correntropy criterion for robust face recognition. IEEE TPAMI, 33(8):1561–1576, 2011.
 [Huang et al., 2016] Feihu Huang, Songcan Chen, and Zhaosong Lu. Stochastic alternating direction method of multipliers with variance reduction for nonconvex optimization. arXiv preprint arXiv:1610.02758, 2016.
 [Huang et al., 2019] Feihu Huang, Bin Gu, Zhouyuan Huo, Songcan Chen, and Heng Huang. Faster gradientfree proximal stochastic methods for nonconvex nonsmooth optimization. In AAAI, 2019.
 [Jiang et al., 2019] Bo Jiang, Tianyi Lin, Shiqian Ma, and Shuzhong Zhang. Structured nonconvex and nonsmooth optimization: algorithms and iteration complexity analysis. Computational Optimization and Applications, 72(1):115–157, 2019.
 [Johnson and Zhang, 2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013.
 [Kim et al., 2009] Seyoung Kim, KyungAh Sohn, and Eric P Xing. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(12):i204–i212, 2009.
 [Liu et al., 2018a] Sijia Liu, Jie Chen, PinYu Chen, and Alfred Hero. Zerothorder online alternating direction method of multipliers: Convergence analysis and applications. In AISTATS, volume 84, pages 288–297, 2018.
 [Liu et al., 2018b] Sijia Liu, Bhavya Kailkhura, PinYu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zerothorder stochastic variance reduction for nonconvex optimization. In NIPS, pages 3731–3741, 2018.
 [Nesterov and Spokoiny, 2017] Yurii Nesterov and Vladimir G. Spokoiny. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
 [Ouyang et al., 2013] Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction method of multipliers. ICML, 28:80–88, 2013.
 [Suzuki, 2014] Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In ICML, pages 736–744, 2014.
 [Taylor et al., 2016] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: a scalable admm approach. In ICML, pages 2722–2731, 2016.
 [Wang et al., 2015] Fenghui Wang, Wenfei Cao, and Zongben Xu. Convergence of multiblock bregman admm for nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015.
 [Xu et al., 2018] Kaidi Xu, Sijia Liu, Pu Zhao, PinYu Chen, Huan Zhang, Deniz Erdogmus, Yanzhi Wang, and Xue Lin. Structured adversarial attack: Towards general implementation and better interpretability. arXiv preprint arXiv:1808.01664, 2018.
 [Zheng and Kwok, 2016] Shuai Zheng and James T Kwok. Fastandlight stochastic admm. In IJCAI, pages 2407–2613, 2016.