Multi-convex Inequality-constrained Alternating Direction Method of Multipliers

Multi-convex Inequality-constrained Alternating Direction Method of Multipliers

Junxiang Wang Department of Information Science and Technology, George Mason University Liang Zhao Department of Information Science and Technology, George Mason University Lingfei Wu Thomas J. Watson Research Center
Abstract

In recent years, although the Alternating Direction Method of Multipliers (ADMM) has been empirically applied widely for many multi-convex applications, delivering an impressive performance in areas such as adversarial learning and nonnegative matrix factorization, there remains a dearth of generic work on multi-convex ADMM with a theoretical guarantee under mild conditions. In this paper, we propose a novel generic framework of multi-convex inequality-constrained ADMM (miADMM) with multiple coupled variables in both objective and constraints. Theoretical properties such as convergence conditions and properties are discussed and proven. Several important applications are discussed as special cases under our miADMM framework. These cases are from a wide variety of topical machine learning problems. Extensive experiments on one synthetic dataset and ten real-world datasets related to multiple applications demonstrate the proposed framework’s effectiveness, scalability, and convergence properties.

1 Introduction

Due to the advantages and popularity of non-differentiable regularized and distributive computing for complex optimization problems, the Alternating Direction Method of Multipliers (ADMM) has received a great deal of attention in recent years [4]. The standard ADMM was originally proposed to solve the following separable convex optimization problem:

where and are closed convex functions, and are matrices and is a vector. There are extensive reports in the literature exploring the theoretical properties for convex optimization problems related to ADMM and its variants, including multi-block ADMM [11], Bregman ADMM [32], fast ADMM [13, 18], and stochastic ADMM [25]. ADMM has now been extended to cover a wide range of nonconvex problems and has achieved significant performance in many practical applications [37].

Unlike convex problems, nonconvex optimizations based on ADMM are much more difficult and the behavior of ADMM for nonconvex problems has been largely a mystery [37]. Current theoretical analytics on nonconvex ADMM typically focus on special nonconvex problems with strict conditions. Most of the existing work imposes theoretical guarantees that require the assumption that and are either decoupled variables or both from convex sets. Recently, however, there have been an increasing number of real-world applications where the objective functions are multi-convex (i.e. nonconvex for all the variables but convex for each when all the others are fixed). For example, a descriptive model and a generative model may be optimized alternately in an adversarial learning framework; for example, the descriptive model may train a classifier while a generative model maximizes the probability of a classifier making mistakes [14], or a dictionary learning application may learn the fixed dictionary and coefficient simultaneously [23]. Nonnegative matrix factorization, which aims to decompose a matrix into a product of two matrices, has been applied widely in computer vision, machine learning and various other fields [20] and a bilinear matrix inequality problem has been designed for the analysis of linear and nonlinear uncertain systems [15]. All of these can be considered special cases of the following problem, which is our focus in this paper:

Problem 1:

where , , and are proper, continuous, multi-convex and possibly nonsmooth functions, are proper, continuous, convex and possibly nonsmooth functions and is a proper, differentiable and convex function. are matrices with full column ranks.
However, Problem 1 is very difficult to solve. Firstly, the objective function is nonconvex: the coupled function is nonconvex, and the tightly coupled variables are on the nonconvex set. This type of problem has not yet been rigorously and systematically investigated. Secondly, Problem 1 has multiple constraints: Aside from the equality constraint , the inequality constraint has a coupled and nonsmooth function . There is no ADMM framework to address optimization problems with coupled inequality constraints like Problem 1. Moreover, the convergence properties of the ADMM required to solve Problem 1 remain unknown. In order to address these challenges simultaneously, we propose a novel multi-convex inequality constrained Alternating Direction Method of Multipliers (miADMM) to solve Problem 1. Our proposed new method, miADMM, splits the complex Problem 1 into multiple smaller subproblems, each of which is projected onto a convex set and thus can be solved exactly. These solvable subproblems support the convergence guarantee of the miADMM. Furthermore, we propose the use of novel mild conditions to ensure the global convergence of miADMM, so it always converges to a critical point for any initialization [19]. Our contributions in this paper include:

  • We propose a novel generic framework for multi-convex inequality constrained ADMM (miADMM) to solve Problem 1. The miADMM breaks the nonconvex Problem 1 into small local convex subproblems, which are then coordinated to find a solution to Problem 1. The standard ADMM is a special case of our miADMM.

  • We investigate the convergence properties of the new miADMM. Specifically, we prove that the variables in Problem 1 and their gradients are bounded during iteration, and the objective value decreases monotonically. Moreover, miADMM is guaranteed to converge to a critical point. The converence rate of miADMM is .

  • We demonstrate several important and promising applications that are special cases of our proposed miADMM framework, and benefit from its theoretical properties. Specifically, we present five applications in the fields of machine learning and control, and give concrete algorithms to solve them using our miADMM framework.

  • We conduct extensive experiments to validate our proposed miADMM. Experiments on a synthetic dataset and ten real-world datasets demonstrate its effectiveness, scalability, and convergence properties.

The rest of this paper is summarized as follows: Section 2 summarizes previous work related to this paper. Section 3 introduces the new miADMM algorithm and its convergence properties. In Section 4, the miADMM algorithm is applied to several important applications. The extensive experiments that have been conducted are described in Section 5. The paper concludes with a summary of the work in Section 6.

2 Related Work

Multi-convex optimization problem: There are some works which studied the multi-convex problems. The earliest work required that the objective function was differentiable continuous and strictly convex [35]. Various conditions on separability and regularity on the objective functions have been discussed in [29, 30]. In the most recent work, Xu and Wo presented three types of multi-convex algorithms and analyzed convergence based on either Lipschitz differentiable or strongly convex assumption [36]. For a comprehensive survey, see [28]. However, to the best of our knowledge, few of them allow the objective function to be nonsmooth and coupled at the same time.
Nonconvex ADMM: Despite the outstanding performance of the nonconvex ADMM, the theorem research on it is not much due to the complexity of both multiple coupled variables and various (inequality and equality) constraints. Specifically, Hong et al. [17] and Cui et al. [10] proposed a majorized ADMM and gave convergence guarantee when the step length was either small or large. Gao and Zhang discussed the convergence properties when the coupled objective function was jointly convex [12]. Wang et al. presented their convergence conditions when the coupled objective function was nonconvex and nonsmooth [34]. Chen et al. discussed the quadratic coupled terms [7].

3 Multi-convex Inequality-constrained ADMM (miADMM)

In this section, we present the framework of the new miADMM. Section 3.1 shows the formulation of miADMM and in Section 3.2 we prove the theoretical convergence of the miADMM based on several mild assumptions.

3.1 The miADMM algorithm

In Problem 1, the variables in the inequality constraint are coupled and difficult to solve. To overcome this challenge, we include in an indicator function and thus the augmented Lagrangian function can be reformulated mathematically as follows:

(1)

where is an indicator function which equals “0” if and otherwise, is a dual variable and is a penalty variable. The miADMM aims to optimize the following subproblems alternately.

(2)

The first subproblems can be written equivalently in the following form for

(3)

Algorithm 1 is presented for Problem 1. Concretely, Line 3-5 and 6 update and , respectively. Line 7 updates the dual variable , which follows the routine of the standard ADMM. Each subproblem is convex and solveable.

0:   .
0:   .
1:   Initialize , .
2:   repeat
3:       for i=1 to n do
4:           Update in Equation (3).
5:       end for
6:       Update in Equation (2).
7:       .
8:       .
9:   until convergence.
10:   Output .
Algorithm 1 miADMM Algorithm to Solve Problem 1

3.2 Convergence Analysis

In this section, we analyze the conditions and properties required for the global convergence of miADMM. We first present necessary definitions and assumptions, then prove that several key properties that lead to the global convergence.

3.2.1 Definitions and assumptions

First, recall the definition of Lipschitz differentiability [6], which can be defined as follows:

Definition 1 (Lipschitz differentiability).

Any arbitrary differentiable function is Lipschitz differentiable if for any ,

where is a constant and denotes the gradient of .

This can be generalized to a new definition of Lipschitz subdifferentiability as follows:

Definition 2 (Lipschitz Subdifferentiability).

Any arbitrary function is Lipschitz subdifferentiable if for any and , there exist two subgradients and such that

where is a constant and denotes the subgradient of .

It is easy to find that Lipschitz subdifferentiability is a generalization of Lipschitz differentiability [3], as all Lipschitz differentiable functions are also Lipschitz subdifferentiable. Moreover, the indicator function is not Lipschitz differentiable at , but it satisfies the Lipschitz subdifferentiability when . This property is crucial in proving Property 3, as discussed later.

Next, several mild assumptions are imposed to ensure global convergence of the new method:

Assumption 1 (Coercivity).

is coercive over the nonempty set . In other words, if and .

Coercivity is such a weak condition for the objective function that many applications satisfy this assumption. For example, most common loss functions, including log loss, hinge loss and square loss, do so.

Assumption 2 (Lipschitz Differentiability and Subdifferentiability).

is Lipschitz differentiable with constant , is Lipschitz subdifferentiable with constant .

Many problems can be reformulated to an equivalent miADMM formulation by introducing and making , as discussed below. Since is Lipschitz differentiable with , this assumption is satisfied. Based on Definition 2, is also Lipschitz subdifferentiable with .

3.2.2 Key Properties

This section focuses on the global convergence of the miADMM algorithm. Specifically, if Assumptions 1-2 are satisfied, then Properties 1-3 also hold, as shown below. They are key properties that ensure the convergence of the miADMM because as long as they hold, the miADMM is guaranteed to converge to a critical point globally.

Property 1 (Boundness).

If , then starting from any such that , is bounded, and defined in Equation (1) is lower bounded.

Property 1 confirms that all variables and the value of have lower bounds. It is proven under Assumptions 1 and 2 , and its proof can be found in Theorem 4 in the supplementary materials.

Property 2 (Sufficient Descent).

If so that , then there exists such that

(4)

The value of is guaranteed to decrease monotonically if is sufficiently large for Property 2. Property 2 holds under Assumptions 1 and 2, and its proof can be found in Theorem 5 in the supplementary materials.

Property 3 (Subgradient Bound).

There exists and such that

(5)

Property 3 states that the subgradient of has a upper bound, which requires Assumption 2. Its proof can be found in Theorem 6 in the supplementary materials. The following three theorems summarize the convergence of the miADMM. The first theorem confirms that three properties are satisfied for miADMM.

Theorem 1 (Convergence Properties).

If and Assumptions 1 and 2 hold, then miADMM satisfies Properties 1, 2 and 3.

Proof.

It can be concluded by Theorem 4, 5 and 6 in the supplementary materials. ∎

The second theorem ensures that the miADMM converges to a critical point for any initial point.

Theorem 2 (Global Convergence).

For the variables in Problem 1, starting from any such that , this sequence generated by miADMM has at least a limit point , and any limit point is a critical point. That is, .

Proof.

Since is bounded, there exists a subsequence such that where is a limit point. By Property 1 and 2, is non-increasing and lower bounded, we prove that and as . We infer there exists such that as based on Property 3. Specifically, as . According to the definition of general subgradient (Defintion 8.3 in [27]), we have . ∎

The third theorem proves that our proposed miADMM can achieve a convergence rate of , despite the nonconvex and complex nature of Problem 1. Such rate is the state-of-the-art even comparing to those methods for simpler convex problems. The theorem is shown as follows:

Theorem 3 (Convergence Rate).

For a sequence , define , then the convergence rate of is .

The proof of this theorem is in Appendix C in the supplementary materials. The convergence rate of miADMM is consistent with much existing work analyzing the convex ADMM, including [16, 22, 11]. Our contribution in term of convergence rate is that we extend the guarantee of into multi-convex problems (Problem 1).

4 Applications

In this section, we apply our proposed miADMM to several real-world applications, all of which conform to Problem 1 and benefit from the convergence properties of the miADMM. The formulation of Problem 1 is widely applied in many applications, including nonnegative matrix factorization, nonnegative tensor completion and dictionary learning [28, 36]. In the following sections, five novel applications are introduced in turn: weakly constrained multi-task learning, learning with sign-network constraints, the bilinear matrix inequality problem, sparse dictionary learning, and nonnegative matrix factorization.

4.1 Weakly-constrained Multi-task Learning

In multi-task learning problems, multiple tasks are learned jointly to achieve a better performance compared with learning tasks independently [38]. Most work on multi-task learning has tended to enforce the assumption of similarity among the feature weight values across tasks [2, 8, 33, 38, 41] because this makes it possible to use convex regularization terms like norms [33] and Graph Laplacians [41]. However, this assumption is usually too strong and is seldom satisfied by the real-world data. Instead of requiring feature weights to be similar in magnitude, a more conservative but probably more reasonable assumption is that multiple tasks share similar polarities for the same feature, which means that if a feature is positively relevant to the output of a task, then its weight will also be positive for other related tasks. This assumption is appropriate for many applications. For example, the feature ‘number of clinic visits’ will be positively related to flu outbreaks, while the feature ‘popularity of vaccination’ will be negatively related to them, even though their feature weights can vary dramatically for different countries (namely tasks here). This is achieved by enforcing the requirement for every pair of tasks with neighboring indices to have the same weight sign. This optimization objective is shown as follows:

(6)

where and denote the number of tasks and features, respectively, is the weight of the feature in the task, is the weight of the task, and and are the loss function and the regularization term of the task, respectively. The inequality constraint implies that the task and the share the same sign for their weights.
However, Equation (6) is nonconvex and thus difficult for existing frameworks to optimize. Fortunately, our miADMM can address this issue by rewriting Equation (6) in the following form:

(7)

where is an auxiliary variable that is applied to make this problem compatible with Problem 1. The miADMM algorithm for this case is shown in Appendix D.1 in the supplementary materials.

4.2 Learning with Signed-Network Constraints

The application of network models for social network analysis has attracted the attention of a number of researchers [5]. For example, influential societal events often spread across many social networking sites and are expressed by different languages. Such multi-lingual indicators usually transmit similar semantic information through networks, and have thus been utilized to facilitate social event forecasting [39]. The problem with network constraints is formulated as follows:

where is the weight of the -th node. is a loss function and is a regularization term for the -th node. and are two edge sets at represent two opposite relationships: means that there exist and such that , while means that there exist and such that , where and denote the -th and -th element of and , respectively. This problem can be reformulated equivalently to the following:

(8)

where is an auxiliary variable to fit this problem into Problem 1. The miADMM algorithm for this case is also shown in Appendix D.2 in the supplementary materials.

4.3 Bilinear Matrix Inequality Problem

The Bilinear Matrix Inequality (BMI) problem has a broad application across many system and control designs [31, 9]. Consider the following BMI formulation:

where and are symmetric matrices, , , and are vectors and denotes positive semi-definiteness. Minimizing and alternately is a popular method for dealing with the BMI problem because of its simplicity and effectiveness [9], as each subproblem is then a linear inequality matrix problem and can thus be solved efficiently. However, this method does not necessarily converge. Instead, the application of our miADMM ensures global convergence, as it can be reformulated as follows:

(9)

where is an auxiliary variable to fit this problem into Problem 1. The miADMM algorithm for this example is shown in Appendix D.3 in the supplementary materials.

4.4 Sparse Dictionary Learning

The sparse dictionary learning problem aims to decompose the data matrix into a product of a dictionary and a sparse matrix [28], which is formulated as follows:

where is a penalty parameter. It is reformulated mathematically below:

(10)

where is an auxiliary variable to fit this problem into Problem 1. The miADMM algorithm for this problem is shown in Appendix D.4 in the supplementary materials.

4.5 Nonnegative Matrix Factorization

Nonnegative matrix factorization is a classical problem that is broadly applicable to a number of different applications [4, 20]. The goal of the nonnegative matrix factorization problem is to decompose into a product of two nonnegative matrices and , where and are all matrices. The problem is formulated as:

Unlike the solution suggested by [4], our proposed miADMM, which includes a convergence guarantee, reformulates the problem as follows:

(11)

where is an auxiliary variable that is incorporated to fit this problem into Problem 1. The miADMM algorithm for this factorization is shown in Appendix D.5 in the supplementary materials.

5 Experiments

In this section, we validate the miADMM using a synthetic dataset and ten real-world datasets on several applications. Scalability, effectiveness, and convergence properties are compared with several existing state-of-the-art methods on many real datasets. All the experiments were conducted on a 64-bit machine with Intel(R) core(TM) processor (i7-6820HQ CPU@ 2.70GHZ) and 16.0GB memory.

5.1 Experiment I: Synthetic Dataset

A very straightforward numerical application on our miADMM framework is to solve the following regularized linear regression problem with biconvex constraints:

(12)

where is the response of the sample, denotes the feature of the sample. and represent the coefficients of the first features and the second features, respectively. is a penalty parameter. Hence, and are the number of samples and features, respectively.

Data Generation and Parameter Settings. The true and were generated from a uniform distribution between and . The features were generated from two uniform distributions between and . was generated from the linear regression where the error term follows Gaussian distribution. and were both set to . and were set to and .

Baselines. In order to test the scalability of miADMM, two baselines were utilized for comparison: 1) Block Coordinate Descent (BCD) [36]. BCD is an intuitive method to solve multi-convex problems, which optimizes each variable alternately. 2) Interior Point Method (IPM) [24]. IPM is a classic barrier method to solve nonlinear optimization problems.

Figure 1: Convergence and scalability on synthetic dataset.

Performance on Convergence and Scalability. Obviously, the problem in Equation 5.1 satisfies the convergence conditions and thus is guaranteed to converge by our miADMM. This is further demonstrated by Figure 1(a), which illustrates the change of the residual along the iteration steps and shows its convergence. Additionally, the objective value is also shown to converge by Figure 1(b). Moreover, Figures 1(c) and (d) further show the scalability of our miADMM and the comparison methods in (i.e., the number of samples) and (i.e., half the number of features). The results show that the time cost increases linearly in both of and . And miADMM generally cost the least amount of time among all these methods, especially compared to IPM. This is because our miADMM can split the biconvex constraints into two subproblems that are much easier to solve.

5.2 Experiment II: Weak-constrained Multi-task Learning

To evaluate the effectiveness of our method on the application of weak-constrained multi-task learning described in Equation (7), a real-world school dataset is used to evaluate the effectiveness of our miADMM. It consists of the examination scores in three years of 15,362 students from 139 secondary schools, which are treated as tasks for examination scores prediction based on 27 input features such as year of the examination, school-specific features, and student-specific features. The dataset is publicly available and the detail description can be found in the original paper [21]. was set to for miADMM.

Metrics. In this experiment, five metrics were utilized to evaluate model performance. Mean Squared Error (MSE) measures the average of the squares of the difference between observation and estimation. Different from MSE, Mean Squared Logarithmic Error (MSLE) measures the ratio of observation to estimation. Mean Absolute Error(MAE) is also an error measurement but computed in the absolute value. The less the above three metrics are, the better a regression model is. Explained Variance (EV) computes the ratio of the variance of error to that of observation. The coefficient of determination or R2 score is the proportion of the variance in the dependent variable that is predictable from the independent variable. The higher score of EV and R2 are, the better a regression model is.

Baselines. In order to validate the effectiveness of miADMM, five benchmark multi-task learning models serve as comparison methods. Loss functions were set to least square errors. All parameters was set based on 5-fold cross validation on the training set.
1. multi-task learning with Joint Feature Selection (JFS) [2, 41] . JFS is one of the most commonly used strategies in multi-task learning. It captures the relatedness of multiple tasks by a constraint of weight matrix to share a common set of features.
2. Clustered Multi-Task Learning (CMTL) [40, 41]. CMTL assumes that multiple tasks are clustered into several groups. Tasks in the same group are similar to each other.
3. multi-task Lasso (mtLasso) [41]. mtLasso extends the classic Lasso model to the multi-task learning setting.
4. a convex relaxation of Alternating Structure Optimization (cASO) [41, 1]. cASO decomposes each task into two components: task-specific feature mapping and task-shared feature mapping.
5. Robust Multi-Task Learning (RMTL) [8, 41]. RMTL aims to detect irrelevant tasks (outliers) from multiple tasks. One way to achieve this is to decompose the model into two parts: a low rank structure to capture task relatedness and a group-sparse structure to detect outliers.

Performance. As discussed in Section 4.1, the convergence of our miADMM is guaranteed based on our theoretical framework. To verify this, Figures 2(a) and 2(b) illustrate the dual residuals and objective values in different iterations, which clearly demonstrates the convergence of the miADMM on this nonconvex problem. Then, the performance of examination score prediction on this dataset is illustrated in Table 1. It shows that the weak-constrained multitask learning model optimized by miADMM achieves the best performance in all the metrics, comparing to all the other five comparison methods. This is because our method only enforces the sign of the feature weight across different tasks are the same, while comparison methods typically perform too aggressive assumption on the similarity among tasks. For example, CMTL enforces that the correlated tasks need to have similar feature weights using squared regularization on the difference between feature weights. JFS, mtLasso, and RMTL still tend to enforce similar weights on features in different tasks by norm. Because their enforcement is weaker than CMTL, better performance from them is obtained. Finally, cASO gets relatively weak performance because it is to optimize an approximation of a nonconvex problem, and thus the solution points may be distant to that of optima in the original problem.
Scalability. To investigate the scalability of the miADMM compared with all baselines in Experiment II, we measured the training time of them in the school dataset when the number of features varies. The training time was averaged by running 20 times.
Figure 3 shows the training time of all methods when the number of features ranges from 10 to 28. Obviously, the training time of all methods increased linearly with regard to the number of features. cASO was the most efficient of all methods, while the miADMM was ranked second. mtLasso, JFS, and RMTL also trained a model within 5 seconds on average. CMTL was time-consuming for training, which spent more than 10 seconds.

Figure 2: Convergence curves on Experiments II and III.
Method MSE MSLE MAE EV R2
JFS 114.1583 0.4457 8.4560 0.2945 0.2945
CMTL 115.5530 0.4517 8.5067 0.2859 0.2859
mtLasso 115.2800 0.4522 8.4874 0.2876 0.2876
cASO 157.9920 0.5235 9.4062 0.1472 0.1472
RMTL 114.1846 0.4478 8.4513 0.2944 0.2943
miADMM 113.6600 0.4457 8.4168 0.2976 0.2976
Table 1: Performance in Experiment II: miADMM outperformed the other methods in all the metrics.
Figure 3: The training time of all methods in Experiment II: the training time of all methods increased linearly with number of features.
BR CL CO EC EL MX PY UY VE
LogReg 0.686 0.677 0.644 0.599 0.618 0.661 0.616 0.628 0.667
LASSO 0.685 0.677 0.648 0.603 0.636 0.665 0.615 0.666 0.669
MTL 0.722 0.669 0.810 0.617 0.772 0.795 0.600 0.811 0.771
MREF 0.714 0.563 0.515 0.784 0.612 0.693 0.658 0.681 0.588
DHML 0.845 0.683 0.846 0.839 0.780 0.793 0.737 0.835 0.835
miADMM 0.847 0.691 0.851 0.838 0.774 0.800 0.736 0.836 0.859
Table 2: Event forecasting performance in AUC in each of the 9 datasets
BR CL CO EC EL MX PY UY VE
LogReg 30,193 2,981 8,060 312 551 17,712 7,297 748 5,563
LASSO 1,535 242 780 295 261 2,043 527 336 1,008
MTL 233 35 108 17 17 853 40 20 49
MREF 25,889 6,521 14,714 4,332 4,669 31,349 9,495 5,305 5,769
DHML 332 852 87 46 33 175 242 82 179
miADMM 20 12 17 7 3 30 6 4 22
Table 3: Comparison of running time (in seconds) on 9 datasets in Experiment III: the miADMM was the most efficient method.

5.3 Experiment III: Event Forecasting with Multi-lingual Indicators

Datasets. To evaluate the performance of our miADMM on the application in Section 4.2, extensive experiments on nine real-world datasets have been performed. The dataset is obtained by randomly sampling 10% (by volume) of the Twitter data from Jan 2013 to Dec 2014. The data in the first and second years are used and training and test set, respectively. For the topic (i.e., social unrest) of interest, 1,806 keywords in the three major languages in Latin America, namely English, Spanish, and Portuguese, as provided by the paper [39]. Their translation relationships have also been labeled as semantic links among them, such as “protest” in English, “protesta” in Spanish, and “protesto” in Portuguese. The event forecasting results were validated against a labeled event set, known as the gold standard report (GSR), which is publicly available [26].

Metric and Baselines. The metric used to evaluate the performance is Area Under the Receiver operating characteristic Curve (AUC). Five comparison methods including the state-of-the-arts Multi-task learning (MTL), Multi-resolution Event Forecasting (MREF), and Distant-supervision of Heterogeneous Multitask Learning (DHML) as well as classic methods logistic regression (LogReg) and Lasso. was set to 1 for miADMM. All the hyper-parameters were tuned by 5-fold cross-validation.
Performance. As shown in Figure 2, miADMM generally performs the best among all the methods, with DHML the second-best performer. Both of them outperform the others typically by at least 5%-10%. This is because that both of them leverage the multilingual correlation among the features to boost up the model generalizability. Thanks to the framework of multi-task learning, MTL and MREF obtained a competitive performance with AUC typically over 0.7, which outperform simple methods like LogReg and LASSO by 5% on average.
Efficiency. In Experiment III, we also compared the training time of the miADMM in comparison with all baselines on 9 datasets. The training time was averaged by running 5 times.
The training time was shown in Table 3. Overall, the miADMM was the most efficient of all methods whatever dataset we chose. It consumed no more than 30 seconds on all datasets. MTL was ranked second, but it spent hundreds of seconds on some datasets, like BR and MX. As the most time-confusing baselines, LogReg and MREF trained a model by thousands of seconds or more.

6 Conclusions

We propose a novel generic framework for multi-convex inequality-constrained optimization with multiple coupled variables, which is a new variant of ADMM named miADMM. miADMM not only inherits the merits of general ADMMs but also provides advantageous theoretical properties on convergence conditions and properties under mild conditions. In addition, several machine learning applications of recent interest are provided as special cases of our proposed miADMM. Extensive experiments have been conducted on a synthetic dataset and ten real-world datasets, and demonstrate the effectiveness, scalability, and convergence properties of our proposed miADMM. In the future, we may explore milder conditions than Lipschitz subdifferentiability because some nonsmooth functions like and do not satisfy Lipschitz subdifferentiability.

References

  • [1] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
  • [2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In Advances in neural information processing systems, pages 41–48, 2007.
  • [3] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
  • [4] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
  • [5] Peter J Carrington, John Scott, and Stanley Wasserman. Models and methods in social network analysis, volume 28. Cambridge university press, 2005.
  • [6] Fabio Cavalletti and Tapio Rajala. Tangent lines and lipschitz differentiability spaces. Analysis and Geometry in Metric Spaces, 4(1), 2016.
  • [7] Caihua Chen, Min Li, Xin Liu, and Yinyu Ye. Extended admm and bcd for nonseparable convex minimization models with quadratic coupling terms: convergence analysis and insights. Mathematical Programming, pages 1–41, 2015.
  • [8] Jianhui Chen, Jiayu Zhou, and Jieping Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 42–50. ACM, 2011.
  • [9] Wei-Yu Chiu. Method of reduction of variables for bilinear matrix inequality problems in system and control designs. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(7):1241–1256, 2017.
  • [10] Ying Cui, Xudong Li, Defeng Sun, and Kim-Chuan Toh. On the convergence properties of a majorized admm for linearly constrained convex optimization problems with coupled objective functions. arXiv preprint arXiv:1502.00098, 2015.
  • [11] Wei Deng, Ming-Jun Lai, Zhimin Peng, and Wotao Yin. Parallel multi-block admm with o (1/k) convergence. Journal of Scientific Computing, 71(2):712–736, 2017.
  • [12] Xiang Gao and Shu-Zhong Zhang. First-order algorithms for convex optimization with nonseparable objective and coupled constraints. Journal of the Operations Research Society of China, 5(2):131–159, 2017.
  • [13] Tom Goldstein, Brendan O’Donoghue, Simon Setzer, and Richard Baraniuk. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588–1623, 2014.
  • [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [15] Arash Hassibi, Jonathan How, and Stephen Boyd. A path-following method for solving bmi problems in control. In American Control Conference, 1999. Proceedings of the 1999, volume 2, pages 1385–1389. IEEE, 1999.
  • [16] Bingsheng He and Xiaoming Yuan. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
  • [17] Mingyi Hong, Tsung-Hui Chang, Xiangfeng Wang, Meisam Razaviyayn, Shiqian Ma, and Zhi-Quan Luo. A block successive upper bound minimization method of multipliers for linearly constrained convex optimization. arXiv preprint arXiv:1401.7079, 2014.
  • [18] Mojtaba Kadkhodaie, Konstantina Christakopoulou, Maziar Sanjabi, and Arindam Banerjee. Accelerated alternating direction method of multipliers. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 497–506. ACM, 2015.
  • [19] Gert R Lanckriet and Bharath K Sriperumbudur. On the convergence of the concave-convex procedure. In Advances in neural information processing systems, pages 1759–1767, 2009.
  • [20] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556–562, 2001.
  • [21] Ya Li, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Multi-task model and feature joint learning. In IJCAI, pages 3643–3649, 2015.
  • [22] Tian-Yi Lin, Shi-Qian Ma, and Shu-Zhong Zhang. On the sublinear convergence rate of multi-block admm. Journal of the Operations Research Society of China, 3(3):251–274, 2015.
  • [23] Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R Bach. Supervised dictionary learning. In Advances in neural information processing systems, pages 1033–1040, 2009.
  • [24] Sanjay Mehrotra. On the implementation of a primal-dual interior point method. SIAM Journal on optimization, 2(4):575–601, 1992.
  • [25] Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction method of multipliers. ICML (1), 28:80–88, 2013.
  • [26] Terry Reed. Open source indicators project: https://doi.org/10.7910/DVN/EN8FUW, 2017.
  • [27] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  • [28] Xinyue Shen, Steven Diamond, Madeleine Udell, Yuantao Gu, and Stephen Boyd. Disciplined multi-convex programming. In Control And Decision Conference (CCDC), 2017 29th Chinese, pages 895–900. IEEE, 2017.
  • [29] Paul Tseng. Dual coordinate ascent methods for non-strictly convex minimization. Mathematical programming, 59(1-3):231–247, 1993.
  • [30] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475–494, 2001.
  • [31] Jeremy G VanAntwerp and Richard D Braatz. A tutorial on linear and bilinear matrix inequalities. Journal of process control, 10(4):363–385, 2000.
  • [32] Huahua Wang and Arindam Banerjee. Bregman alternating direction method of multipliers. In Advances in Neural Information Processing Systems, pages 2816–2824, 2014.
  • [33] Lu Wang, Yan Li, Jiayu Zhou, Dongxiao Zhu, and Jieping Ye. Multi-task survival analysis. In 2017 IEEE International Conference on Data Mining (ICDM), pages 485–494. IEEE, 2017.
  • [34] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, pages 1–35, 2015.
  • [35] Jack Warga. Minimizing certain convex functions. Journal of the Society for Industrial and Applied Mathematics, 11(3):588–593, 1963.
  • [36] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.
  • [37] Zheng Xu, Soham De, Mario Figueiredo, Christoph Studer, and Tom Goldstein. An empirical study of admm for nonconvex problems. arXiv preprint arXiv:1612.03349, 2016.
  • [38] Yu Zhang and Qiang Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.
  • [39] Liang Zhao, Junxiang Wang, and Xiaojie Guo. Distant-supervision of heterogeneous multitask learning for social event forecasting with multilingual indicators. 2018.
  • [40] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-task learning via alternating structure optimization. In Advances in neural information processing systems, pages 702–710, 2011.
  • [41] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Malsar: Multi-task learning via structural regularization. Arizona State University, 21, 2011.

Appendix

Appendix A Preliminary Lemmas for Proving Three Properties

In this section, we give preliminary lemmas which are useful for the proofs of three properties. While Lemma 2 and 3 depend on the optimality conditions of subproblems, Lemma 1 and 4 require Assumption 2.

Lemma 1.

It holds that ,

Proof.

Because is Lipschitz differentiable by Assumption 2, so is . Therefore, this lemma is proven exactly as same as Lemma 2.1 in [3]. ∎

Lemma 2.

It holds that for all .

Proof.

The optimality condition of gives rise to

Because , we have . ∎

Lemma 3.

It holds that for ,

(13)
Proof.

where the second equality follows from the cosine rule: with , and .
Because , we have the following result according to the definition of subgradient