Multiconvex Inequalityconstrained Alternating Direction Method of Multipliers
Abstract
In recent years, although the Alternating Direction Method of Multipliers (ADMM) has been empirically applied widely for many multiconvex applications, delivering an impressive performance in areas such as adversarial learning and nonnegative matrix factorization, there remains a dearth of generic work on multiconvex ADMM with a theoretical guarantee under mild conditions. In this paper, we propose a novel generic framework of multiconvex inequalityconstrained ADMM (miADMM) with multiple coupled variables in both objective and constraints. Theoretical properties such as convergence conditions and properties are discussed and proven. Several important applications are discussed as special cases under our miADMM framework. These cases are from a wide variety of topical machine learning problems. Extensive experiments on one synthetic dataset and ten realworld datasets related to multiple applications demonstrate the proposed framework’s effectiveness, scalability, and convergence properties.
1 Introduction
Due to the advantages and popularity of nondifferentiable regularized and distributive computing for complex optimization problems, the Alternating Direction Method of Multipliers (ADMM) has received a great deal of attention in recent years [4]. The standard ADMM was originally proposed to solve the following separable convex optimization problem:
where and are closed convex functions, and are matrices and is a vector. There are extensive reports in the literature exploring the theoretical properties for convex optimization problems related to ADMM and its variants, including multiblock ADMM [11], Bregman ADMM [32], fast ADMM [13, 18], and stochastic ADMM [25]. ADMM has now been extended to cover a wide range of nonconvex problems and has achieved significant performance in many practical applications [37].
Unlike convex problems, nonconvex optimizations based on ADMM are much more difficult and the behavior of ADMM for nonconvex problems has been largely a mystery [37]. Current theoretical analytics on nonconvex ADMM typically focus on special nonconvex problems with strict conditions. Most of the existing work imposes theoretical guarantees that require the assumption that and are either decoupled variables or both from convex sets. Recently, however, there have been an increasing number of realworld applications where the objective functions are multiconvex (i.e. nonconvex for all the variables but convex for each when all the others are fixed). For example, a descriptive model and a generative model may be optimized alternately in an adversarial learning framework; for example, the descriptive model may train a classifier while a generative model maximizes the probability of a classifier making mistakes [14], or a dictionary learning application may learn the fixed dictionary and coefficient simultaneously [23]. Nonnegative matrix factorization, which aims to decompose a matrix into a product of two matrices, has been applied widely in computer vision, machine learning and various other fields [20] and a bilinear matrix inequality problem has been designed for the analysis of linear and nonlinear uncertain systems [15]. All of these can be considered special cases of the following problem, which is our focus in this paper:
Problem 1:  
where , , and are proper, continuous, multiconvex and possibly nonsmooth functions, are proper, continuous, convex and possibly nonsmooth functions and is a proper, differentiable and convex function. are matrices with full column ranks.
However, Problem 1 is very difficult to solve. Firstly, the objective function is nonconvex: the coupled function is nonconvex, and the tightly coupled variables are on the nonconvex set. This type of problem has not yet been rigorously and systematically investigated. Secondly, Problem 1 has multiple constraints: Aside from the equality constraint , the inequality constraint has a coupled and nonsmooth function . There is no ADMM framework to address optimization problems with coupled inequality constraints like Problem 1. Moreover, the convergence properties of the ADMM required to solve Problem 1 remain unknown. In order to address these challenges simultaneously, we propose a novel multiconvex inequality constrained Alternating Direction Method of Multipliers (miADMM) to solve Problem 1. Our proposed new method, miADMM, splits the complex Problem 1 into multiple smaller subproblems, each of which is projected onto a convex set and thus can be solved exactly. These solvable subproblems support the convergence guarantee of the miADMM. Furthermore, we propose the use of novel mild conditions to ensure the global convergence of miADMM, so it always converges to a critical point for any initialization [19]. Our contributions in this paper include:

We propose a novel generic framework for multiconvex inequality constrained ADMM (miADMM) to solve Problem 1. The miADMM breaks the nonconvex Problem 1 into small local convex subproblems, which are then coordinated to find a solution to Problem 1. The standard ADMM is a special case of our miADMM.

We investigate the convergence properties of the new miADMM. Specifically, we prove that the variables in Problem 1 and their gradients are bounded during iteration, and the objective value decreases monotonically. Moreover, miADMM is guaranteed to converge to a critical point. The converence rate of miADMM is .

We demonstrate several important and promising applications that are special cases of our proposed miADMM framework, and benefit from its theoretical properties. Specifically, we present five applications in the fields of machine learning and control, and give concrete algorithms to solve them using our miADMM framework.

We conduct extensive experiments to validate our proposed miADMM. Experiments on a synthetic dataset and ten realworld datasets demonstrate its effectiveness, scalability, and convergence properties.
The rest of this paper is summarized as follows: Section 2 summarizes previous work related to this paper. Section 3 introduces the new miADMM algorithm and its convergence properties. In Section 4, the miADMM algorithm is applied to several important applications. The extensive experiments that have been conducted are described in Section 5. The paper concludes with a summary of the work in Section 6.
2 Related Work
Multiconvex optimization problem: There are some works which studied the multiconvex problems. The earliest work required that the objective function was differentiable continuous and strictly convex [35]. Various conditions on separability and regularity on the objective functions have been discussed in [29, 30]. In the most recent work, Xu and Wo presented three types of multiconvex algorithms and analyzed convergence based on either Lipschitz differentiable or strongly convex assumption [36]. For a comprehensive survey, see [28]. However, to the best of our knowledge, few of them allow the objective function to be nonsmooth and coupled at the same time.
Nonconvex ADMM: Despite the outstanding performance of the nonconvex ADMM, the theorem research on it is not much due to the complexity of both multiple coupled variables and various (inequality and equality) constraints.
Specifically, Hong et al. [17] and Cui et al. [10] proposed a majorized ADMM and gave convergence guarantee when the step length was either small or large. Gao and Zhang discussed the convergence properties when the coupled objective function was jointly convex [12]. Wang et al. presented their convergence conditions when the coupled objective function was nonconvex and nonsmooth [34]. Chen et al. discussed the quadratic coupled terms [7].
3 Multiconvex Inequalityconstrained ADMM (miADMM)
In this section, we present the framework of the new miADMM. Section 3.1 shows the formulation of miADMM and in Section 3.2 we prove the theoretical convergence of the miADMM based on several mild assumptions.
3.1 The miADMM algorithm
In Problem 1, the variables in the inequality constraint are coupled and difficult to solve. To overcome this challenge, we include in an indicator function and thus the augmented Lagrangian function can be reformulated mathematically as follows:
(1) 
where is an indicator function which equals “0” if and otherwise, is a dual variable and is a penalty variable. The miADMM aims to optimize the following subproblems alternately.
(2)  
The first subproblems can be written equivalently in the following form for
(3) 
Algorithm 1 is presented for Problem 1. Concretely, Line 35 and 6 update and , respectively. Line 7 updates the dual variable , which follows the routine of the standard ADMM. Each subproblem is convex and solveable.
3.2 Convergence Analysis
In this section, we analyze the conditions and properties required for the global convergence of miADMM. We first present necessary definitions and assumptions, then prove that several key properties that lead to the global convergence.
3.2.1 Definitions and assumptions
First, recall the definition of Lipschitz differentiability [6], which can be defined as follows:
Definition 1 (Lipschitz differentiability).
Any arbitrary differentiable function is Lipschitz differentiable if for any ,
where is a constant and denotes the gradient of .
This can be generalized to a new definition of Lipschitz subdifferentiability as follows:
Definition 2 (Lipschitz Subdifferentiability).
Any arbitrary function is Lipschitz subdifferentiable if for any and , there exist two subgradients and such that
where is a constant and denotes the subgradient of .
It is easy to find that Lipschitz subdifferentiability is a generalization of Lipschitz differentiability [3], as all Lipschitz differentiable functions are also Lipschitz subdifferentiable. Moreover, the indicator function is not Lipschitz differentiable at , but it satisfies the Lipschitz subdifferentiability when . This property is crucial in proving Property 3, as discussed later.
Next, several mild assumptions are imposed to ensure global convergence of the new method:
Assumption 1 (Coercivity).
is coercive over the nonempty set . In other words, if and .
Coercivity is such a weak condition for the objective function that many applications satisfy this assumption. For example, most common loss functions, including log loss, hinge loss and square loss, do so.
Assumption 2 (Lipschitz Differentiability and Subdifferentiability).
is Lipschitz differentiable with constant , is Lipschitz subdifferentiable with constant .
Many problems can be reformulated to an equivalent miADMM formulation by introducing and making , as discussed below. Since is Lipschitz differentiable with , this assumption is satisfied. Based on Definition 2, is also Lipschitz subdifferentiable with .
3.2.2 Key Properties
This section focuses on the global convergence of the miADMM algorithm. Specifically, if Assumptions 12 are satisfied, then Properties 13 also hold, as shown below. They are key properties that ensure the convergence of the miADMM because as long as they hold, the miADMM is guaranteed to converge to a critical point globally.
Property 1 (Boundness).
If , then starting from any such that , is bounded, and defined in Equation (1) is lower bounded.
Property 1 confirms that all variables and the value of have lower bounds. It is proven under Assumptions 1 and 2 , and its proof can be found in Theorem 4 in the supplementary materials.
Property 2 (Sufficient Descent).
If so that , then there exists such that
(4) 
The value of is guaranteed to decrease monotonically if is sufficiently large for Property 2. Property 2 holds under Assumptions 1 and 2, and its proof can be found in Theorem 5 in the supplementary materials.
Property 3 (Subgradient Bound).
There exists and such that
(5) 
Property 3 states that the subgradient of has a upper bound, which requires Assumption 2. Its proof can be found in Theorem 6 in the supplementary materials. The following three theorems summarize the convergence of the miADMM. The first theorem confirms that three properties are satisfied for miADMM.
Theorem 1 (Convergence Properties).
The second theorem ensures that the miADMM converges to a critical point for any initial point.
Theorem 2 (Global Convergence).
For the variables in Problem 1, starting from any such that , this sequence generated by miADMM has at least a limit point , and any limit point is a critical point. That is, .
Proof.
Since is bounded, there exists a subsequence such that where is a limit point. By Property 1 and 2, is nonincreasing and lower bounded, we prove that and as . We infer there exists such that as based on Property 3. Specifically, as . According to the definition of general subgradient (Defintion 8.3 in [27]), we have . ∎
The third theorem proves that our proposed miADMM can achieve a convergence rate of , despite the nonconvex and complex nature of Problem 1. Such rate is the stateoftheart even comparing to those methods for simpler convex problems. The theorem is shown as follows:
Theorem 3 (Convergence Rate).
For a sequence , define , then the convergence rate of is .
The proof of this theorem is in Appendix C in the supplementary materials. The convergence rate of miADMM is consistent with much existing work analyzing the convex ADMM, including [16, 22, 11]. Our contribution in term of convergence rate is that we extend the guarantee of into multiconvex problems (Problem 1).
4 Applications
In this section, we apply our proposed miADMM to several realworld applications, all of which conform to Problem 1 and benefit from the convergence properties of the miADMM. The formulation of Problem 1 is widely applied in many applications, including nonnegative matrix factorization, nonnegative tensor completion and dictionary learning [28, 36]. In the following sections, five novel applications are introduced in turn: weakly constrained multitask learning, learning with signnetwork constraints, the bilinear matrix inequality problem, sparse dictionary learning, and nonnegative matrix factorization.
4.1 Weaklyconstrained Multitask Learning
In multitask learning problems, multiple tasks are learned jointly to achieve a better performance compared with learning tasks independently [38]. Most work on multitask learning has tended to enforce the assumption of similarity among the feature weight values across tasks [2, 8, 33, 38, 41] because this makes it possible to use convex regularization terms like norms [33] and Graph Laplacians [41]. However, this assumption is usually too strong and is seldom satisfied by the realworld data. Instead of requiring feature weights to be similar in magnitude, a more conservative but probably more reasonable assumption is that multiple tasks share similar polarities for the same feature, which means that if a feature is positively relevant to the output of a task, then its weight will also be positive for other related tasks. This assumption is appropriate for many applications. For example, the feature ‘number of clinic visits’ will be positively related to flu outbreaks, while the feature ‘popularity of vaccination’ will be negatively related to them, even though their feature weights can vary dramatically for different countries (namely tasks here). This is achieved by enforcing the requirement for every pair of tasks with neighboring indices to have the same weight sign. This optimization objective is shown as follows:
(6)  
where and denote the number of tasks and features, respectively, is the weight of the feature in the task, is the weight of the task, and and are the loss function and the regularization term of the task, respectively. The inequality constraint implies that the task and the share the same sign for their weights.
However, Equation (6) is nonconvex and thus difficult for existing frameworks to optimize. Fortunately, our miADMM can address this issue by rewriting Equation (6) in the following form:
(7)  
where is an auxiliary variable that is applied to make this problem compatible with Problem 1. The miADMM algorithm for this case is shown in Appendix D.1 in the supplementary materials.
4.2 Learning with SignedNetwork Constraints
The application of network models for social network analysis has attracted the attention of a number of researchers [5]. For example, influential societal events often spread across many social networking sites and are expressed by different languages. Such multilingual indicators usually transmit similar semantic information through networks, and have thus been utilized to facilitate social event forecasting [39]. The problem with network constraints is formulated as follows:
where is the weight of the th node. is a loss function and is a regularization term for the th node. and are two edge sets at represent two opposite relationships: means that there exist and such that , while means that there exist and such that , where and denote the th and th element of and , respectively. This problem can be reformulated equivalently to the following:
(8)  
where is an auxiliary variable to fit this problem into Problem 1. The miADMM algorithm for this case is also shown in Appendix D.2 in the supplementary materials.
4.3 Bilinear Matrix Inequality Problem
The Bilinear Matrix Inequality (BMI) problem has a broad application across many system and control designs [31, 9]. Consider the following BMI formulation:
where and are symmetric matrices, , , and are vectors and denotes positive semidefiniteness. Minimizing and alternately is a popular method for dealing with the BMI problem because of its simplicity and effectiveness [9], as each subproblem is then a linear inequality matrix problem and can thus be solved efficiently. However, this method does not necessarily converge. Instead, the application of our miADMM ensures global convergence, as it can be reformulated as follows:
(9)  
where is an auxiliary variable to fit this problem into Problem 1. The miADMM algorithm for this example is shown in Appendix D.3 in the supplementary materials.
4.4 Sparse Dictionary Learning
The sparse dictionary learning problem aims to decompose the data matrix into a product of a dictionary and a sparse matrix [28], which is formulated as follows:
where is a penalty parameter. It is reformulated mathematically below:
(10)  
where is an auxiliary variable to fit this problem into Problem 1. The miADMM algorithm for this problem is shown in Appendix D.4 in the supplementary materials.
4.5 Nonnegative Matrix Factorization
Nonnegative matrix factorization is a classical problem that is broadly applicable to a number of different applications [4, 20]. The goal of the nonnegative matrix factorization problem is to decompose into a product of two nonnegative matrices and , where and are all matrices. The problem is formulated as:
Unlike the solution suggested by [4], our proposed miADMM, which includes a convergence guarantee, reformulates the problem as follows:
(11) 
where is an auxiliary variable that is incorporated to fit this problem into Problem 1. The miADMM algorithm for this factorization is shown in Appendix D.5 in the supplementary materials.
5 Experiments
In this section, we validate the miADMM using a synthetic dataset and ten realworld datasets on several applications. Scalability, effectiveness, and convergence properties are compared with several existing stateoftheart methods on many real datasets. All the experiments were conducted on a 64bit machine with Intel(R) core(TM) processor (i76820HQ CPU@ 2.70GHZ) and 16.0GB memory.
5.1 Experiment I: Synthetic Dataset
A very straightforward numerical application on our miADMM framework is to solve the following regularized linear regression problem with biconvex constraints:
(12)  
where is the response of the sample, denotes the feature of the sample. and represent the coefficients of the first features and the second features, respectively. is a penalty parameter. Hence, and are the number of samples and features, respectively.
Data Generation and Parameter Settings. The true and were generated from a uniform distribution between and . The features were generated from two uniform distributions between and . was generated from the linear regression where the error term follows Gaussian distribution. and were both set to . and were set to and .
Baselines. In order to test the scalability of miADMM, two baselines were utilized for comparison: 1) Block Coordinate Descent (BCD) [36]. BCD is an intuitive method to solve multiconvex problems, which optimizes each variable alternately. 2) Interior Point Method (IPM) [24]. IPM is a classic barrier method to solve nonlinear optimization problems.
Performance on Convergence and Scalability. Obviously, the problem in Equation 5.1 satisfies the convergence conditions and thus is guaranteed to converge by our miADMM. This is further demonstrated by Figure 1(a), which illustrates the change of the residual along the iteration steps and shows its convergence. Additionally, the objective value is also shown to converge by Figure 1(b). Moreover, Figures 1(c) and (d) further show the scalability of our miADMM and the comparison methods in (i.e., the number of samples) and (i.e., half the number of features). The results show that the time cost increases linearly in both of and . And miADMM generally cost the least amount of time among all these methods, especially compared to IPM. This is because our miADMM can split the biconvex constraints into two subproblems that are much easier to solve.
5.2 Experiment II: Weakconstrained Multitask Learning
To evaluate the effectiveness of our method on the application of weakconstrained multitask learning described in Equation (7), a realworld school dataset is used to evaluate the effectiveness of our miADMM. It consists of the examination scores in three years of 15,362 students from 139 secondary schools, which are treated as tasks for examination scores prediction based on 27 input features such as year of the examination, schoolspecific features, and studentspecific features. The dataset is publicly available and the detail description can be found in the original paper [21]. was set to for miADMM.
Metrics. In this experiment, five metrics were utilized to evaluate model performance. Mean Squared Error (MSE) measures the average of the squares of the difference between observation and estimation. Different from MSE, Mean Squared Logarithmic Error (MSLE) measures the ratio of observation to estimation. Mean Absolute Error(MAE) is also an error measurement but computed in the absolute value. The less the above three metrics are, the better a regression model is. Explained Variance (EV) computes the ratio of the variance of error to that of observation. The coefficient of determination or R2 score is the proportion of the variance in the dependent variable that is predictable from the independent variable. The higher score of EV and R2 are, the better a regression model is.
Baselines.
In order to validate the effectiveness of miADMM, five benchmark multitask learning models serve as comparison methods. Loss functions were set to least square errors. All parameters was set based on 5fold cross validation on the training set.
1. multitask learning with Joint Feature Selection (JFS) [2, 41] . JFS is one of the most commonly used strategies in multitask learning. It captures the relatedness of multiple tasks by a constraint of weight matrix to share a common set of features.
2. Clustered MultiTask Learning (CMTL) [40, 41]. CMTL assumes that multiple tasks are clustered into several groups. Tasks in the same group are similar to each other.
3. multitask Lasso (mtLasso) [41]. mtLasso extends the classic Lasso model to the multitask learning setting.
4. a convex relaxation of Alternating Structure Optimization (cASO) [41, 1]. cASO decomposes each task into two components: taskspecific feature mapping and taskshared feature mapping.
5. Robust MultiTask Learning (RMTL) [8, 41]. RMTL aims to detect irrelevant tasks (outliers) from multiple tasks. One way to achieve this is to decompose the model into two parts: a low rank structure to capture task relatedness and a groupsparse structure to detect outliers.
Performance.
As discussed in Section 4.1, the convergence of our miADMM is guaranteed based on our theoretical framework. To verify this, Figures 2(a) and 2(b) illustrate the dual residuals and objective values in different iterations, which clearly demonstrates the convergence of the miADMM on this nonconvex problem. Then, the performance of examination score prediction on this dataset is illustrated in Table 1. It shows that the weakconstrained multitask learning model optimized by miADMM achieves the best performance in all the metrics, comparing to all the other five comparison methods. This is because our method only enforces the sign of the feature weight across different tasks are the same, while comparison methods typically perform too aggressive assumption on the similarity among tasks. For example, CMTL enforces that the correlated tasks need to have similar feature weights using squared regularization on the difference between feature weights. JFS, mtLasso, and RMTL still tend to enforce similar weights on features in different tasks by norm. Because their enforcement is weaker than CMTL, better performance from them is obtained. Finally, cASO gets relatively weak performance because it is to optimize an approximation of a nonconvex problem, and thus the solution points may be distant to that of optima in the original problem.
Scalability. To investigate the scalability of the miADMM compared with all baselines in Experiment II, we measured the training time of them in the school dataset when the number of features varies. The training time was averaged by running 20 times.
Figure 3 shows the training time of all methods when the number of features ranges from 10 to 28. Obviously, the training time of all methods increased linearly with regard to the number of features. cASO was the most efficient of all methods, while the miADMM was ranked second. mtLasso, JFS, and RMTL also trained a model within 5 seconds on average. CMTL was timeconsuming for training, which spent more than 10 seconds.
Method  MSE  MSLE  MAE  EV  R2 
JFS  114.1583  0.4457  8.4560  0.2945  0.2945 
CMTL  115.5530  0.4517  8.5067  0.2859  0.2859 
mtLasso  115.2800  0.4522  8.4874  0.2876  0.2876 
cASO  157.9920  0.5235  9.4062  0.1472  0.1472 
RMTL  114.1846  0.4478  8.4513  0.2944  0.2943 
miADMM  113.6600  0.4457  8.4168  0.2976  0.2976 
BR  CL  CO  EC  EL  MX  PY  UY  VE  

LogReg  0.686  0.677  0.644  0.599  0.618  0.661  0.616  0.628  0.667 
LASSO  0.685  0.677  0.648  0.603  0.636  0.665  0.615  0.666  0.669 
MTL  0.722  0.669  0.810  0.617  0.772  0.795  0.600  0.811  0.771 
MREF  0.714  0.563  0.515  0.784  0.612  0.693  0.658  0.681  0.588 
DHML  0.845  0.683  0.846  0.839  0.780  0.793  0.737  0.835  0.835 
miADMM  0.847  0.691  0.851  0.838  0.774  0.800  0.736  0.836  0.859 
BR  CL  CO  EC  EL  MX  PY  UY  VE  

LogReg  30,193  2,981  8,060  312  551  17,712  7,297  748  5,563 
LASSO  1,535  242  780  295  261  2,043  527  336  1,008 
MTL  233  35  108  17  17  853  40  20  49 
MREF  25,889  6,521  14,714  4,332  4,669  31,349  9,495  5,305  5,769 
DHML  332  852  87  46  33  175  242  82  179 
miADMM  20  12  17  7  3  30  6  4  22 
5.3 Experiment III: Event Forecasting with Multilingual Indicators
Datasets. To evaluate the performance of our miADMM on the application in Section 4.2, extensive experiments on nine realworld datasets have been performed. The dataset is obtained by randomly sampling 10% (by volume) of the Twitter data from Jan 2013 to Dec 2014. The data in the first and second years are used and training and test set, respectively. For the topic (i.e., social unrest) of interest, 1,806 keywords in the three major languages in Latin America, namely English, Spanish, and Portuguese, as provided by the paper [39]. Their translation relationships have also been labeled as semantic links among them, such as “protest” in English, “protesta” in Spanish, and “protesto” in Portuguese. The event forecasting results were validated against a labeled event set, known as the gold standard report (GSR), which is publicly available [26].
Metric and Baselines.
The metric used to evaluate the performance is Area Under the Receiver operating characteristic Curve (AUC). Five comparison methods including the stateofthearts Multitask learning (MTL), Multiresolution Event Forecasting (MREF), and Distantsupervision of Heterogeneous Multitask Learning (DHML) as well as classic methods logistic regression (LogReg) and Lasso. was set to 1 for miADMM. All the hyperparameters were tuned by 5fold crossvalidation.
Performance. As shown in Figure 2, miADMM generally performs the best among all the methods, with DHML the secondbest performer. Both of them outperform the others typically by at least 5%10%. This is because that both of them leverage the multilingual correlation among the features to boost up the model generalizability. Thanks to the framework of multitask learning, MTL and MREF obtained a competitive performance with AUC typically over 0.7, which outperform simple methods like LogReg and LASSO by 5% on average.
Efficiency. In Experiment III, we also compared the training time of the miADMM in comparison with all baselines on 9 datasets. The training time was averaged by running 5 times.
The training time was shown in Table 3. Overall, the miADMM was the most efficient of all methods whatever dataset we chose. It consumed no more than 30 seconds on all datasets. MTL was ranked second, but it spent hundreds of seconds on some datasets, like BR and MX. As the most timeconfusing baselines, LogReg and MREF trained a model by thousands of seconds or more.
6 Conclusions
We propose a novel generic framework for multiconvex inequalityconstrained optimization with multiple coupled variables, which is a new variant of ADMM named miADMM. miADMM not only inherits the merits of general ADMMs but also provides advantageous theoretical properties on convergence conditions and properties under mild conditions. In addition, several machine learning applications of recent interest are provided as special cases of our proposed miADMM. Extensive experiments have been conducted on a synthetic dataset and ten realworld datasets, and demonstrate the effectiveness, scalability, and convergence properties of our proposed miADMM. In the future, we may explore milder conditions than Lipschitz subdifferentiability because some nonsmooth functions like and do not satisfy Lipschitz subdifferentiability.
References
 [1] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
 [2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multitask feature learning. In Advances in neural information processing systems, pages 41–48, 2007.
 [3] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 [4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
 [5] Peter J Carrington, John Scott, and Stanley Wasserman. Models and methods in social network analysis, volume 28. Cambridge university press, 2005.
 [6] Fabio Cavalletti and Tapio Rajala. Tangent lines and lipschitz differentiability spaces. Analysis and Geometry in Metric Spaces, 4(1), 2016.
 [7] Caihua Chen, Min Li, Xin Liu, and Yinyu Ye. Extended admm and bcd for nonseparable convex minimization models with quadratic coupling terms: convergence analysis and insights. Mathematical Programming, pages 1–41, 2015.
 [8] Jianhui Chen, Jiayu Zhou, and Jieping Ye. Integrating lowrank and groupsparse structures for robust multitask learning. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 42–50. ACM, 2011.
 [9] WeiYu Chiu. Method of reduction of variables for bilinear matrix inequality problems in system and control designs. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(7):1241–1256, 2017.
 [10] Ying Cui, Xudong Li, Defeng Sun, and KimChuan Toh. On the convergence properties of a majorized admm for linearly constrained convex optimization problems with coupled objective functions. arXiv preprint arXiv:1502.00098, 2015.
 [11] Wei Deng, MingJun Lai, Zhimin Peng, and Wotao Yin. Parallel multiblock admm with o (1/k) convergence. Journal of Scientific Computing, 71(2):712–736, 2017.
 [12] Xiang Gao and ShuZhong Zhang. Firstorder algorithms for convex optimization with nonseparable objective and coupled constraints. Journal of the Operations Research Society of China, 5(2):131–159, 2017.
 [13] Tom Goldstein, Brendan O’Donoghue, Simon Setzer, and Richard Baraniuk. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588–1623, 2014.
 [14] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [15] Arash Hassibi, Jonathan How, and Stephen Boyd. A pathfollowing method for solving bmi problems in control. In American Control Conference, 1999. Proceedings of the 1999, volume 2, pages 1385–1389. IEEE, 1999.
 [16] Bingsheng He and Xiaoming Yuan. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
 [17] Mingyi Hong, TsungHui Chang, Xiangfeng Wang, Meisam Razaviyayn, Shiqian Ma, and ZhiQuan Luo. A block successive upper bound minimization method of multipliers for linearly constrained convex optimization. arXiv preprint arXiv:1401.7079, 2014.
 [18] Mojtaba Kadkhodaie, Konstantina Christakopoulou, Maziar Sanjabi, and Arindam Banerjee. Accelerated alternating direction method of multipliers. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 497–506. ACM, 2015.
 [19] Gert R Lanckriet and Bharath K Sriperumbudur. On the convergence of the concaveconvex procedure. In Advances in neural information processing systems, pages 1759–1767, 2009.
 [20] Daniel D Lee and H Sebastian Seung. Algorithms for nonnegative matrix factorization. In Advances in neural information processing systems, pages 556–562, 2001.
 [21] Ya Li, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Multitask model and feature joint learning. In IJCAI, pages 3643–3649, 2015.
 [22] TianYi Lin, ShiQian Ma, and ShuZhong Zhang. On the sublinear convergence rate of multiblock admm. Journal of the Operations Research Society of China, 3(3):251–274, 2015.
 [23] Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R Bach. Supervised dictionary learning. In Advances in neural information processing systems, pages 1033–1040, 2009.
 [24] Sanjay Mehrotra. On the implementation of a primaldual interior point method. SIAM Journal on optimization, 2(4):575–601, 1992.
 [25] Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction method of multipliers. ICML (1), 28:80–88, 2013.
 [26] Terry Reed. Open source indicators project: https://doi.org/10.7910/DVN/EN8FUW, 2017.
 [27] R Tyrrell Rockafellar and Roger JB Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
 [28] Xinyue Shen, Steven Diamond, Madeleine Udell, Yuantao Gu, and Stephen Boyd. Disciplined multiconvex programming. In Control And Decision Conference (CCDC), 2017 29th Chinese, pages 895–900. IEEE, 2017.
 [29] Paul Tseng. Dual coordinate ascent methods for nonstrictly convex minimization. Mathematical programming, 59(13):231–247, 1993.
 [30] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475–494, 2001.
 [31] Jeremy G VanAntwerp and Richard D Braatz. A tutorial on linear and bilinear matrix inequalities. Journal of process control, 10(4):363–385, 2000.
 [32] Huahua Wang and Arindam Banerjee. Bregman alternating direction method of multipliers. In Advances in Neural Information Processing Systems, pages 2816–2824, 2014.
 [33] Lu Wang, Yan Li, Jiayu Zhou, Dongxiao Zhu, and Jieping Ye. Multitask survival analysis. In 2017 IEEE International Conference on Data Mining (ICDM), pages 485–494. IEEE, 2017.
 [34] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, pages 1–35, 2015.
 [35] Jack Warga. Minimizing certain convex functions. Journal of the Society for Industrial and Applied Mathematics, 11(3):588–593, 1963.
 [36] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.
 [37] Zheng Xu, Soham De, Mario Figueiredo, Christoph Studer, and Tom Goldstein. An empirical study of admm for nonconvex problems. arXiv preprint arXiv:1612.03349, 2016.
 [38] Yu Zhang and Qiang Yang. A survey on multitask learning. arXiv preprint arXiv:1707.08114, 2017.
 [39] Liang Zhao, Junxiang Wang, and Xiaojie Guo. Distantsupervision of heterogeneous multitask learning for social event forecasting with multilingual indicators. 2018.
 [40] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multitask learning via alternating structure optimization. In Advances in neural information processing systems, pages 702–710, 2011.
 [41] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Malsar: Multitask learning via structural regularization. Arizona State University, 21, 2011.
Appendix
Appendix A Preliminary Lemmas for Proving Three Properties
In this section, we give preliminary lemmas which are useful for the proofs of three properties. While Lemma 2 and 3 depend on the optimality conditions of subproblems, Lemma 1 and 4 require Assumption 2.
Lemma 1.
It holds that ,
Proof.
Lemma 2.
It holds that for all .
Proof.
The optimality condition of gives rise to
Because , we have . ∎
Lemma 3.
It holds that for ,
(13) 
Proof.
where the second equality follows from the cosine rule: with , and .
Because , we have the following result according to the definition of subgradient