Penalty Method for Inversion-Free Deep Bilevel Optimization

Penalty Method for Inversion-Free Deep Bilevel Optimization


Bilevel optimization problems are at the center of several important machine learning problems such as hyperparameter tuning, data denoising, meta- and few-shot learning, data poisoning. Different from simultaneous or multi-objective optimization, bilevel optimization requires computing the inverse of the Hessian of the lower-level cost function to obtain the exact descent direction for the upper-level cost. In this paper, we propose a new method for solving deep bilevel optimization problems using the penalty function which avoids computing the inverse. We prove convergence of our method under mild conditions and show that it computes the exact hypergradient asymptotically. Small space and time complexity of our method enables us to solve large-scale bilevel problems involving deep neural networks with several million parameters. We present results of our method for data denoising on MNIST/CIFAR10/SVHN datasets, for few-shot learning on Omniglot/Mini-Imagenet datasets and for training-data poisoning on MNIST/Imagenet datasets. In all experiments, our method outperforms or is comparable to previously proposed methods both in terms of accuracy and run-time.


1 Introduction

Solving a bilevel optimization problem is crucial for the fields of study which involve a competition between two parties or two objectives. Particularly, a bilevel problem arises, if one party makes its choice first affecting the optimal choice for the second party, also known as the Stackelberg model, dating back to 1930’s von Stackelberg (2010). The general form of a bilevel optimization problem is


A bilevel problem with constraints is of the form , where the lower-level constraint set can depend on . However, we focus on unconstrained problems in this paper. The ‘upper-level’ problem is a usual minimization problem except that is constrained to be the solution to the ‘lower-level’ problem which is dependent on (see Bard (2013) for a review of bilevel optimization). Bilevel optimizations also appears in many important machine learning problems. For example, gradient-based hyperparameter tuning Domke (2012); Maclaurin et al. (2015); Luketina et al. (2016); Pedregosa (2016); Franceschi et al. (2017, 2018), data denoising by importance learning Liu and Tao (2016); Yu et al. (2017); Ren et al. (2018), few-shot learning Ravi and Larochelle (2017); Santoro et al. (2016); Vinyals et al. (2016); Franceschi et al. (2017); Mishra et al. (2017); Snell et al. (2017); Franceschi et al. (2018); Rajeswaran et al. (2019), and training-data poisoning Mei and Zhu (2015); Muñoz-González et al. (2017); Koh and Liang (2017); Shafahi et al. (2018). We explain each of these problems and their bilevel formulations below.

Gradient-based hyperparameter tuning. Finding hyperparameters is an indispensable step in any machine learning problem. Grid search is a popular way of finding the optimal hyperparameters, if the domain of the hyperparameters is a predetermined discrete set or a range. However, when losses are differentiable functions of the hyperparameter(s), we can find optimal hyperparameter values by solving a continuous bilevel optimization. Let and denote hyperparameter(s) and parameter(s) for a class of learning algorithms and be the hypothesis. Then is the validation loss, and is the training loss, defined similarly. The best hyperparameter(s) is then the solution to the following problem


Thus, we find the best model parameters for each choice of the hyperparameter , and select that value for the hyperparameter which incurs the smallest validation loss.

Data denoising by importance learning. A common assumption of learning is that the training examples are i.i.d. samples from the same distribution as the test data. However, if training and testing distributions are not identical or if the training examples have been corrupted by noise or modified by adversaries, this assumption is violated. In such cases, re-weighting the importance of each training example, before training, can help reduce the discrepancy between the two distributions. For example, one can up-weight the importance of the examples from the same distribution and down-weight the importance of the rest. Finding the correct weight for each training example can be formulated as a bilevel optimization. Consider to be the vector of non-negative importance values for each training example where is the number of training examples, and are the parameter(s) of a classifier . Then the weighted training error is . Assuming we have a small number of examples, from the same distribution as the test examples (validation examples) the importance learning problem is:


Hence, the importance of each training example (vector ) is selected such that the minimizer of the weighted training loss in the lower level also minimizes the validation loss in the upper-level. The final importance values can help to identify good points from the noisy training set and the classifier obtained after solving this optimization will have superior performance compared to the model trained on the noisy data.

Meta-learning. A standard learning problem involves finding the best model from the class of hypotheses for a given task (i.e., data distribution). In contrast, meta-learning is a problem of learning a prior on the hypothesis classes (a.k.a. inductive bias) for a given set of tasks. Few-shot learning is an example of meta-learning, where a learner is trained on several related tasks, during the meta-training phase, so that it can generalize well to unseen (but related) tasks with just few examples, during the meta-testing phase. An effective approach to the few-shot learning problem is to learn a common representation for various tasks and train task specific classifiers on top of this representation. Let be the map that takes raw features to a common representation for all tasks and be the classifier for the -th task, where is the total number of tasks for training. The goal of few-shot learning is to learn both the representation map parameterized by and the set of classifiers parameterized by . Let be the validation loss of task and be the training loss defined similarly, then the bilevel problem for few-shot learning is


For evaluation of the learned representation, during the meta-test phase, the representation is kept fixed and only the classifiers for the new tasks are trained i.e. where is the total number of tasks for testing.

Training-data poisoning. Recently, machine learning models were shown to be vulnerable to train-time attacks. Different from the test time attacks, here adversary modifies the training data so that the model learned from altered training data performs poorly/differently as compared to the model learned from clean data. The most popular train-time attack method augments the original training data with one or more ‘poisoned’ examples i.e., , assigning them arbitrary labels, to create the poisoned dataset with being the loss on the poisoned training data. The problem of finding poisoning points, that when added to the clean training data hurt the performance of the model trained on it can be formulated as


where the minus sign in the upper-level is used to maximize the validation loss. This is the formulation for untargeted attacks. For targeted attacks, the upper-level must minimize the validation loss with respect to the intended target labels of the attacker. Another variant of the poisoning attack, only influences the prediction of a single predetermined example. The upper-level cost for such an attack is the loss over this particular example (see Eq. (10) in the Appendix E.4.1).

Challenges of deep bilevel optimization. General bilevel optimization problems cannot be solved using simultaneous optimization of the upper- and lower-level costs. Moreover, exact solution to a bilevel optimization is known to be NP-hard even for linear cost functions Bard (1991). To add to this, recent deep learning models, with millions of variables, make it infeasible to use sophisticated methods beyond the first-order methods. But, for bilevel problems, even the first-order methods are difficult to apply since they require computation of the inverse Hessian–gradient product to get the exact hypergradient (see Sec. 2.1). Since direct inversion of the Hessian is impractical even for moderate-sized problems, many approaches have been proposed to approximate the exact hypergradient including forward/reverse-mode differentiation Maclaurin et al. (2015); Franceschi et al. (2017); Shaban et al. (2018) and approximate inversion by solving a linear system of equations Domke (2012); Pedregosa (2016); Rajeswaran et al. (2019). But, still there is a big room for improvement in these existing approaches in terms of their space and time complexities and their practical performance.

Contributions. We propose a penalty function-based algorithm (Alg. 1) for solving large-scale unconstrained bilevel optimization problems. We prove convergence of the method under mild conditions (Theorem 2) and show that it computes the exact hypergradient asymptotically (Lemma 3). We present complexity analysis of the algorithm showing that it has linear time and constant space complexity (Table 1), making our method superior to forward-mode and reverse-mode differentiation and similar to the approximate inversion based method. Small space and time complexity of our approach enables us to solve large-scale bilevel problems involving deep neural networks (Table 7 in Appendix E). We apply our method (Penalty) to solve various machine learning problems like data denoising, few-shot learning, and training-data poisoning. Penalty performs competitively to the state-of-the-art methods on simpler problems (with convex lower-level cost) and significantly outperforms other methods on complex problems (with non-convex lower-level cost), both in terms of accuracy (Sec. 3) and run-time (Table 2 and Fig. 4 in Appendix D) showing that it is an effective solver for various bilevel problems.

The remainder of the paper is organized as follows. We present and analyze the main algorithm in Sec. 2, perform comprehensive experiments in Sec. 3, and conclude the paper in Sec. 4. Due to space limitation, proofs, experimental settings and additional results are presented in the appendix. All codes are uploaded to Github.

2 Inversion-Free Penalty Method

Throughout the paper we have assumed that upper- and lower-level costs and are twice continuously differentiable in both and . We use and to denote gradient vectors, for the matrix , and for the Hessian matrix . Additionally, following previous works we have assumed that the lower-level solution is unique for all and that is invertible everywhere. Later in this section we discuss relaxation for some of these assumptions.

2.1 Hypergradient for bilevel optimization

Assuming we can express the solution to the lower-level problem explicitly, we can write the bilevel problem as an equivalent single-level problem . We can use gradient-based approach on this single-level problem and compute the total derivative , called the hypergradient, in previous approaches. Using the chain rule, the total derivative is


In reality, can be written explicitly only for trivial problems, but we can still compute using the implicit function theorem. Since at , we get and consequently . Thus, the hypergradient from Eq. (6) is as follows


Existing approaches Domke (2012); Maclaurin et al. (2015); Pedregosa (2016); Franceschi et al. (2017); Shaban et al. (2018); Rajeswaran et al. (2019) can be viewed as implicit methods of approximating the hypergradient, with distinct trade-offs in efficiency and complexity.

2.2 Penalty function approach

A bilevel problem can be considered as a constrained optimization problem since the lower-level optimality is a constraint, in addition to any other constraint in the upper- and the lower-level problems. In this work, we focus on unconstrained bilevel problems i.e. those without any additional constraints on the upper- and lower-level problems. For solving such bilevel problems, the lower-level problem is often replaced by its necessary condition for optimality, resulting in the following problem:


For general bilevel problems, Eq. (8) and Eq. (1) are not the same Dempe and Dutta (2012). But, with lower-level cost being convex in for each and the assumption that the lower-level solution is unique for each , Eq. (8) is equivalent to Eq. (1).

We now describe the penalty function approach for solving bilevel optimization. The penalty function method is a well-known approach for solving constrained optimization problems (see Bertsekas (1997) for a review) and has been previously applied for solving bilevel problems. However, it was analyzed under strict assumptions and only high-level descriptions of the algorithm were presented before Aiyoshi and Shimizu (1984); Ishizuka and Aiyoshi (1992). The penalty function , optimizes the original cost plus a quadratic penalty term (which penalizes the violation of the necessary conditions for lower-level optimality) in Eq. (8). Let be the minimum of the penalty function for a given :


Then the following convergence result is known.

Theorem 1 (Theorem 8.3.1 of Bard (2013)).

Assume and are convex in for any fixed . Let be any positive () and divergent () sequence. If is the corresponding sequence of optimal solutions of the penalty function Eq. (9), then the sequence has limit points any one of which is a solution of Eq. (1).

Even though this is a strong result, its not very practical, since the minimum needs to be computed exactly for each , and moreover and need to be convex in for any . In our approach, we allow -optimal solutions of Eq. (9) and show convergence to a KKT point of Eq. (8) without requiring convexity.

Theorem 2.

Suppose is a positive () and convergent () sequence, and is a positive (), non-decreasing (), and divergent () sequence. Let be the sequence of approximate solutions to Eq. (9) with tolerance for all . Then any limit point of satisfies the KKT conditions of the problem in Eq. (8).

Alg. 1 describes our method in which we minimize the penalty function in Eq. (9), alternatively over and . It is essential to note that our method solves a single-level penalty function, Eq. (9), and does not need any intermediate step to compute the approximate hypergradient, unlike other methods which first approximate the solution to the lower-level problem of Eq. (1) and then use an intermediate step (solving a linear system or using reverse/forward mode differentiation) to compute the approximate hypergradient. Lemma 3 (below) shows when the approximate gradient direction , computed from Alg. 1 becomes the exact hypergradient Eq. (7) for bilevel problems.

Lemma 3.

Given , let be from Eq. (9). Then, .

Thus if we find the minimizer of the penalty function for given and , Alg. 1 computes the exact hypergradient Eq. (7) at . Furthermore, under the conditions of Theorem 1, as and we get the exact hypergradient asymptotically.

Comparison with other methods. Many methods have been proposed previously to solve bilevel optimization problems appearing in machine learning, including forward/reverse-mode differentiation (FMD/RMD) Maclaurin et al. (2015); Franceschi et al. (2017); Shaban et al. (2018) and approximate hypergradient computation by solving a linear system (ApproxGrad) Domke (2012); Pedregosa (2016); Rajeswaran et al. (2019). For completeness, we have described these methods briefly in Appendix B. We show the trade-offs of these methods for computing the hypergradient in Table 1. We see that as (total number of -updates per one hypergradient computation) increases, FMD and RMD become impractical due to time complexity and space complexity, respectively, whereas ApproxGrad and Penalty, have the same linear time complexity and constant space complexity, which is a big advantage over FMD and RMD. However, complexity analysis does not show the quality of hypergradient approximation of each method. In Sec. 3.1 we show empirically that the proposed penalty method has better convergence properties than all the other methods with synthetic examples and since ApproxGrad and Penalty have the same complexities we compare them on real data and show that Penalty is superior or comparable to ApproxGrad in performance and time.

Initialize randomly

  for  do
     while  do
        for  do
           -update:  (from Eq. (9))
        end for
        -update:  (from Eq. (9))
     end while
  end for
Algorithm 1 Penalty method for bilevel optimization
Method -update Intermediate updates Time Space
Penalty Not required
Table 1: Complexity analysis of various bilevel methods (Appendix B). is the size of , is the size of , and is the number of -updates per one hypergradient computation. , and are variables of size , , and used to compute the hypergradient. We use gradient descent as the process for FMD and RMD. Hessian-vector product has complexity Pearlmutter (1994).

Improvements. A caveat to these theoretical guarantees is that some of the assumptions made for analysis may not be satisfied in practice. Here we discuss simple techniques to address these problems and improve Alg. 1 further. The first problem is related to non-convexity of the lower-level cost , creating the problem that the local minimum of can be either a minimum or a maximum of . To address this we modify the -update for Eq. (9) by adding a ‘regularization’ term to the cost Thus, the -update becomes . This only affects the optimization in the beginning; as the final solution remains unaffected with or without regularization. The second problem is that the tolerance may not be satisfied in a limited time and the optimization may terminate before becomes large enough. A cure to this is the method of multipliers and augmented Lagrangian Bertsekas (1976) which allows the penalty method to find a solution with a finite . Thus we add the term to the penalty function (Eq. (9)) to get and use the method of multiplier to update as . In summary, we use the following update rules in the paper.

These changes are helpful in theory but were only moderately better than original versions empirically (Appendix C).

3 Experiments

In this section, we evaluate the performance of the proposed penalty method (Penalty) on various machine learning problems discussed in the introduction. We compare Penalty against both bilevel and non-bilevel solutions to these problems previously reported in the literature.

3.1 Synthetic examples

Here we compare Penalty, gradient descent (GD), reverse-mode differentiation (RMD), and approximate hypergradient method (ApproxGrad) on synthetic examples. We omit the comparison with forward-mode differentiation (FMD) because of its impractical time complexity for larger problems. GD refers to the alternating minimization: , . For RMD, we used the version with gradient descent as the lower-level process. For all ApproxGrad experiments, we used Adam for the updates. We found that Adam solves the linear system just as effectively as conjugate gradient descent (CG) given enough number of iterations and can take advantage of a GPU making it feasible to run more iterations in the same time as taken by CG. Using simple quadratic surfaces for and , we compare all the algorithms by observing their convergence as a function of the number of upper-level iterations by varying the number of lower-level updates (), for computing the hypergradient update. We measure the convergence of these methods using the Euclidean distance of the current iterate from the closest optimal solution . Since the synthetic examples are not learning problems, we can only measure the distance of the iterates to an optimal solution (). Fig. 1 shows the performance of two 10-dimensional examples described in the caption (see Appendix E.1). As one would expect, increasing the number of -updates makes all the algorithms better since doing more lower-level iterations makes the hypergradient estimation more accurate (Eq. (7)) but it also increases the run time of the methods. However, even for these examples, only Penalty and ApproxGrad converge to the optimal solution and GD and RMD converge to non-solution points (regardless of ). Moreover, from Fig. 1(b), we see that Penalty converges even with =1 while ApproxGrad requires at least =10 to converge, which shows that our method approximates the hypergradient accurately with smaller . This directly translates to smaller run-time for our method as compared to ApproxGrad since the run-time is directly proportional to (see Table. 1).

Figure 1: Convergence of GD, RMD, ApproxGrad, and Penalty for two example bilevel problems after 40000 epochs. The mean curve (blue) is superimposed on 20 independent trials (yellow).
Figure 2: Convergence of GD, RMD, ApproxGrad, and Penalty for two example bilevel problems where is a rank-deficient random matrix after 40000 epochs. The mean curve (blue) is superimposed on 20 independent trials (yellow).

In Fig. 2 we show examples similar to Fig. 1 but with ill-conditioned or singular Hessian for the lower-level problem. The ill-conditioning poses difficulty for the methods since the implicit function theorem requires the invertibility of the Hessian at the solution point. Compared to Fig. 1, Fig. 2 shows that only Penalty converges to the true solution despite the fact that we add regularization in ApproxGrad to improve the ill-conditioning when solving the linear systems by minimization. We ascribe the robustness of Penalty to its simplicity and to the fact that it naturally handles non-uniqueness of the lower-level solution (see Appendix C.3). Additionally, we report the wall clock times for different methods on the four examples tested here in Table 2. We can see that as we increase the number of lower-level iterations all methods get slower but Penalty is faster than both RMD and ApproxGrad. Although, Penalty is slower than GD but as shown in Fig. 1 and Fig. 2, GD does not converge to optima for most of the synthetic examples.

Example 1 GD RMD ApproxGrad Penalty
T=1 7.40.3 15.00.1 17.40.2 17.20.1
T=5 14.30.1 51.40.3 39.32.3 34.30.3
T=10 23.20.1 95.40.2 60.90.3 57.01.0
Example 2 GD RMD ApproxGrad Penalty
T=1 7.70.1 18.50.1 17.20.3 17.40.2
T=5 17.30.1 62.70.1 37.90.1 35.00.2
T=10 22.42.6 115.00.4 64.20.3 52.71.4
Example 3 GD RMD ApproxGrad Penalty
T=1 8.20.2 18.80.1 19.80.1 19.10.1
T=5 17.40.1 72.40.1 47.10.4 38.60.4
T=10 28.70.6 125.09.3 80.60.3 62.70.1
Example 4 GD RMD ApproxGrad Penalty
T=1 7.90.1 19.50.1 20.40.0 19.60.1
T=5 16.90.2 72.80.5 48.40.6 40.20.1
T=10 28.30.2 138.00.2 81.21.6 58.04.3
Table 2: Mean wall-clock time (sec) for 10,000 upper-level iterations for synthetic experiments. Boldface is the smallest among RMD, ApproxGrad, and Penalty. (Mean s.d. of 10 runs)

3.2 Data denoising by importance learning

Now, we evaluate the performance of Penalty for learning a classifier from a dataset with corrupted labels (training data). We pose the problem as an importance learning problem presented in Eq. (3). We evaluate the performance of the classifier learned by Penalty, with 20 lower-level updates, against the following classifiers: Oracle: classifier trained on the portion of training data with clean labels and the validation data, Val-only: classifier trained only on the validation data, Train+Val: classifier trained on the entire training and validation data, ApproxGrad: classifier trained with our implementation of ApproxGrad, with 20 lower-level and 20 linear system updates. We test the performance on MNIST, CIFAR10 and SVHN datasets with validation set sizes of 1000, 10000 and 1000 points respectively. We used convolutional neural networks (architectures described in Appendix E.2) at the lower-level for this task. Table 3 summarizes our results for this problem and shows that Penalty outperforms Val-only, Train+Val and ApproxGrad by significant margins and in fact performs very close to the Oracle classifier (which is the ideal classifier), even for high noise levels. This demonstrates that Penalty is extremely effective in solving bilevel problems involving several million variables (see Table 7 in Appendix) and shows its effectiveness at handling non-convex problems. Along with improvement in terms of accuracy over other bilevel methods like ApproxGrad, Penalty also gives better run-time per upper-level iteration, leading to a decrease in the overall wall-clock time of the experiments (Fig. 4(a) in Appendix D).

We compared Penalty against the recent method of Ren et al. which uses a meta-learning based approach and assigns weights to examples based on their gradient directions. We used the same setting as their uniform flip experiment with 36% label noise on CIFAR dataset and Wide ResNet 28-10 (WRN-28-10) model (see Appendix  E.2). Using =1 for Penalty and 1000 validation points, we get an accuracy of 87.41 0.26 (mean s.d. of 5 trials) higher than 86.92 0.19 reported by them. Due to the enormous size of WRN-28-10, we do not use larger values of , but expect it to only improve the results based on the Fig. 4(a) in Appendix D. We also compared Penalty against the RMD-based method of Franceschi et al., using the same setting as their Sec. 5.1, on a subset of MNIST data corrupted with 50% label noise and softmax regression as the model (see Appendix E.2). The accuracy of the classifier trained on a subset of the data with points having importance values greater than 0.9 (as computed by Penalty with ) along with the validation set is 90.77% better than 90.09% reported by the RMD-based method.

Dataset Bilevel Approaches
(Noise%) Oracle Val-Only Train+Val ApproxGrad Penalty
MNIST (25) 99.30.1 90.50.3 83.91.3 98.110.08 98.890.04
MNIST (50) 99.30.1 90.50.3 60.82.5 97.270.15 97.510.07
CIFAR10 (25) 82.91.1 70.31.8 79.10.8 71.590.87 79.671.01
CIFAR10 (50) 80.71.2 70.31.8 72.21.8 68.080.83 79.031.19
SVHN (25) 91.10.5 70.61.5 71.61.4 80.051.37 88.120.16
SVHN (50) 89.80.6 70.61.5 47.91.3 74.181.05 85.210.34
Table 3: Test accuracy (%) of the classifier learnt from datasets with noisy labels using importance learning. (Mean s.d. of 5 runs)
Learning a common representation
MAML iMAML Proto Net Rel Net SNAIL RMD ApproxGrad Penalty
5-way 1-shot 98.7 99.500.26 98.8 99.60.2 99.1 98.6 97.750.06 97.830.35
5-way 5-shot 99.9 99.74 0.11 99.7 99.80.1 99.8 99.5 99.510.05 99.450.05
20-way 1-shot 95.8 96.18 0.36 96.0 97.60.2 97.6 95.5 94.690.22 94.060.17
20-way 5-shot 98.9 99.14 0.10 98.9 99.10.1 99.4 98.4 98.460.08 98.470.08
5-way 1-shot 48.701.75 49.30 1.88 49.420.78 50.440.82 55.710.99 50.540.85 43.741.75 53.170.96
5-way 5-shot 63.110.92 - 68.200.66 65.320.82 68.880.92 64.530.68 65.560.67 67.740.71
Table 4: Few-shot classification accuracy (%) on Omniglot and Mini-ImageNet. We report means.d. for Omniglot and 95% confidence intervals for Mini-Imagenet over five trials. Results for learning a common representation using Penalty, ApproxGrad and RMDFranceschi et al. (2018) are averaged over 600 randomly-sampled tasks from the meta-test set. Results for previous methods using similar models are also reported (MAMLFinn et al. (2017), iMAMLRajeswaran et al. (2019), Prototypical Networks (Proto Net)Snell et al. (2017), Relation Networks (Rel Net)Sung et al. (2018), SNAILMishra et al. (2017)).

3.3 Few-shot learning

Next, we evaluate the performance of Penalty on the task of learning a common representation for the few-shot learning problem. We use the formulation presented in Eq. (4) and use Omniglot Lake et al. (2015) and Mini-ImageNet Vinyals et al. (2016) datasets for our experiments. Following the protocol proposed by Vinyals et al. (2016) for -way -shot classification, we generate meta-training and meta-testing datasets. Each meta-set is built using images from disjoint classes. For Omniglot, our meta-training set comprises of images from the first 1200 classes and the remaining 423 classes are used in the meta-testing dataset. We also augment the meta-datasets with three different rotations (90, 180 and 270 degrees) of the images as used by Santoro et al. (2016). For the experiments with Mini-Imagenet, we used the split of 64 classes in meta-training, 16 classes in meta-validation and 20 classes in meta-testing as used by Ravi and Larochelle (2017).

Each meta-batch of the meta-training and meta-testing dataset comprises of a number of tasks which is called the meta-batch-size. Each task in the meta-batch consists of a training set with images and a testing set consists of 15 images from classes. We train Penalty using a meta-batch-size of 30 for 5 way and 15 for 20 way classification for Omniglot and with a meta-batch-size of 2 for Mini-ImageNet experiments. The training sets of the meta-train-batch are used to train the lower-level problem and the test sets are used as validation sets for the upper-level problem in Eq. (4). The final accuracy is reported using the meta-test-set, for which we fix the common representation learnt during meta-training. We then train the classifiers at the lower-level for 100 steps using the training sets from the meta-test-batch and evaluate the performance of each task on the associated test set from the meta-test-batch. Average performance of Penalty and ApproxGrad over 600 tasks is reported in Table 4. Penalty outperforms other bilevel methods namely the ApproxGrad (trained with 20 lower-level iterations and 20 updates for the linear system) and the RMD-based method Franceschi et al. (2018) on Mini-Imagenet and is comparable to them on the Omniglot. We also show the trade-off between using higher and time for ApproxGrad and Penalty (Fig. 4(b) in Appendix D)) showing that Penalty achieves the same accuracy as ApproxGrad in almost half the run-time. In comparison to non-bilevel approaches which used models of the similar size as ours, Penalty is comparable to most approaches and is only slightly worse than Mishra et al. (2017) which makes use of temporal convolutions and soft attention. We used four-layer convolutional neural networks with 64 filters per layer and a residual network with four residual blocks consisting of 64, 96, 128 and 256 filters followed by two convolutional layers for learning the common task representation (upper-level) for Omniglot and Mini-ImageNet, respectively and use logistic regression to learn task specific classifiers (lower-level).

Untargeted Attacks (lower accuracy is better)
Poisoned points Label flipping RMD ApproxGrad Penalty
1% 86.710.32 85 82.090.84 83.290.43
2% 86.23 0.98 83 77.540.57 78.140.53
3% 85.170.96 82 74.411.14 75.141.09
4% 84.930.55 81 71.880.40 72.700.46
5% 84.391.06 80 68.690.86 69.481.93
6% 84.640.69 79 66.910.89 67.591.17
Targeted Attacks (higher accuracy is better)
Poisoned points Label flipping RMD ApproxGrad Penalty
1% 7.761.07 10 18.841.90 17.403.00
2% 12.082.13 15 39.643.72 41.644.43
3% 18.361.23 25 52.762.69 51.402.72
4% 24.412.05 35 60.011.61 61.161.34
5% 30.414.24 - 65.614.01 65.522.85
6% 32.883.47 - 71.484.24 70.012.95
Table 5: Test accuracy (%) of untargeted poisoning attack (TOP) and success rate (%) of targeted attack (BOTTOM), using MNIST and logistic regression (Mean s.d. of 5 runs). Results for RMD are from Muñoz-González et al. (2017).

3.4 Training-data poisoning

Finally, we evaluate Penalty on the task of generating poisoned training data, such that models trained on this data, perform poorly/differently as compared to the models trained on the clean data. We use the same setting as Sec. 4.2 of Muñoz-González et al. (2017) and test both untargeted and targeted data poisoning on MNIST using data augmentation technique. We assume regularized logistic regression will be the classifier used for training. The poisoned points obtained after solving Eq. (5) by various methods are added to the clean training set and the performance of a new classifier trained on this data is used to report the results in Table 5. For untargeted attack, our aim is to generally lower the performance of the classifier on the clean test set. For this experiment, we select a random subset of 1000 training, 1000 validation and 8000 testing points from MNIST and initialize the poisoning points with random instances from the training set but assign them incorrect random labels. We use these poisoned points along with clean training data to train logistic regression, in the lower-level problem of Eq. (5). For targeted attacks, we aim to misclassify images of eights as threes. For this, we selected a balanced subset (each of the 10 classes are represented equally in the subset) of 1000 training, 4000 validation and 5000 testing points from the MNIST dataset. Then we select images of class 8 from the validation set and label them as 3 and use only these images for the upper-level problem in Eq. 5 with a difference that now we want to minimize the error in the upper level instead of maximizing (we don’t have a negative sign in the upper level of Eq. 5). To evaluate the performance we selected images of 8 from the test set and labeled them as 3 and report the performance on this modified subset of the original test set in targeted attack section of Table 5. For this experiment the poisoned points are initialized with images of classes 3 and 8 from the training set, with flipped labels. This is because images of threes and eights are the only ones involved in the poisoning. We compare the performance of Penalty against the performance reported using RMD in Muñoz-González et al. (2017) and ApproxGrad. For ApproxGrad, we used 20 lower-level and 20 linear system updates to report the results in the Table 5. We see that Penalty significantly outperforms the RMD based method and performs similar to ApproxGrad. However, in terms of wall clock time Penalty has a advantage over ApproxGrad (see Fig. 4(c) in Appendix D). We also compared the methods against a label flipping baseline where we select poisoned points from the validation sets and change their labels (randomly for untargeted attacks and mislabel threes as 8 and eights as 3 for targeted attacks). All bilevel methods are able to beat this baseline showing that solving the bilevel problem generates better poisoning points. Examples of the poisoned points for untargeted and targeted attacks generated by Penalty are shown in Figs. 5 and 6 in Appendix E.4. Additionally, we tested Penalty on the task of generating clean label poisoning attack Koh and Liang (2017); Shafahi et al. (2018) where goal is to learn poisoned points, such that they will be assigned correct labels when visually inspected by an expert, but can cause misclassification of specific target images when the classifier is trained on these poisoned points along with clean data (Appendix E.4.1). We used the dog vs. fish dataset and followed the setting in Sec. 5.2 of Koh and Liang (2017), to achieve 100% attack success with just a single poisoned point per target image, compared to 57% attack success in the Koh and Liang (2017). Shafahi et al. (2018) also report 100% attack success on this task.

4 Conclusion

A wide range of interesting machine learning problems can be expressed as bilevel optimization problems, and new applications are still being discovered. So far, the difficulty of solving bilevel optimization has limited its wide-spread use for solving large-scale problems, specially, involving deep models. In this paper we presented an efficient algorithm based on penalty function to solve bilevel optimization, which is both simple and has theoretical and practical advantages over existing methods. As compared to previous methods we demonstrated competitive performance on problems with convex lower-level costs and significant improvement on problems with non-convex lower-level costs both in terms of accuracy and time, highlighting the practical effectiveness of our penalty-based method. In future works, we plan to tackle other challenges in bilevel optimization such as handling additional constraints in both upper- and lower-levels.


We provide missing proofs in Appendix A, a review of other methods of hypergradient computation in Appendix B, discuss modifications to improve the Alg. 1 in Appendix C, and the experiment details and additional results in Appendix E.

Appendix A Proofs

Theorem 2.

Suppose is a positive () and convergent () sequence, and is a positive (), non-decreasing (), and a divergent () sequence. Let be the sequence of approximate solutions to Eq. (9) with tolerance for all . Then any limit point of satisfies the KKT conditions of the problem Eq. (8).


The proof follows the standard proof for penalty function methods, e.g., Nocedal and Wright (2006). Let refer to the pair, and let be any limit point of the sequence , then there is a subsequence such that . From the tolerance condition

we have

Take the limit with respect to the subsequence on both sides to get

Since is a tall matrix and is invertible by assumption, is full-rank and therefore , which is the primary feasibility condition in Eq. (8). Furthermore, let , then by definition,

We can write

The corresponding limit can be found by taking the limit of the subsequence

Since from the condition , we get

at the limit , which is the stationarity condition of Eq. (8). Together with the feasibility condition , the two KKT conditions of Eq. (8) are satisfied at the limit point. ∎

Lemma 3.

Given , let be of Eq. (9). Then, .


At the minimum the gradient vanishes, that is .
Equivalently, . Then,

where disappears, which is the hypergradient . ∎

That is, if we find the minimum of the penalty function for given and , we get the hypergradient Eq. (7) at . Furthermore, under the conditions of Theorem 1, as (see Lemma 8.3.1 of Bard (2013)), and we get the exact hypergradient asymptotically.

Appendix B Review of other bilevel optimization methods

Several methods have been proposed to solve bilevel optimization problems appearing in machine learning, including forward/reverse-mode differentiation Maclaurin et al. (2015); Franceschi et al. (2017) and approximate gradient Domke (2012); Pedregosa (2016) described briefly here.

Forward-mode (FMD) and Reverse-mode differentiation (RMD). Domke Domke (2012), Maclaurin et al.Maclaurin et al. (2015), Franceschi et al. Franceschi et al. (2017), and Shaban et al. Shaban et al. (2018) studied forward and reverse-mode differentiation to solve the minimization problem where the lower-level variable follows a dynamical system . This setting is more general than that of a bilevel problem. However, a stable dynamical system is one that converges to a steady state and thus, the process can be considered as minimizing an energy or a potential function.

Define and , then the hypergradient Eq. (7) can be computed by

When the lower-level process is one step of gradient descent on a cost function , that is,

we get

is of dimension and is of dimension . The sequences and can be computed in forward or reverse mode. For reverse-mode differentiation, first compute

then compute

Time and space Complexity for computing is since the Jacobian vector product can be computed in time and space. The final hypergradient for RMD is .
Hence the final time complexity for RMD is and space complexity is .

For forward-mode differentiation, simultaneously compute , , and

Time complexity for computing is since can be computed using Hessian vector products each needing and also needs using unit vectors for . The space complexity for each is . The final hypergradient for FMD is

Hence the final time complexity for FMD is and space complexity is .

Approximate hypergradient (ApproxGrad). Since computing the inverse of the Hessian directly is difficult even for moderately-sized neural networks, Domke Domke (2012) proposed to find an approximate solution to by solving the linear system of equations . This can be done by solving

using gradient descent, conjugate gradient descent or any other iterative solver. Note that the minimization requires evaluation of the Hessian-vector product, which can be done in linear time Pearlmutter (1994). Hence the time complexity of the method is and space complexity is since we only need to store single copy of and same as Penalty. The asymptotic convergence with approximate solutions was shown by Pedregosa Pedregosa (2016).

Appendix C Improvements to Algorithm  1

Here we discuss the details of the modifications to Alg. 1 presented in the main text which can be added to improve the performance of the algorithm in practice.

c.1 Improving local convexity by regularization

One of the common assumptions of this and previous works is that is invertible and locally positive definite. Neither invertibility nor positive definiteness hold in general for bilevel problems, involving deep neural networks, and this causes difficulties in the optimization. Note that if is non-convex in , minimizing the penalty term does not neccesarily lower the cost but instead just moves the variable towards a stationary point – which is a known problem even for the Newton’s method. Thus we propose the following modification to the -update:

keeping the -update intact. To see how this affects the optimization, note that -update becomes

After converges to a stationary point, we get
, and after plugging this into -update, we get

that is, the Hessian inverse is replaced by a regularized version to improve the positive definiteness of the Hessian. With a decreasing or constant sequence such that the regularization does not change to solution.

c.2 Convergence with finite

The penalty function method is intuitive and easy to implement, but the sequence is guaranteed to converge to an optimal solution only in the limit with , which may not be achieved in practice in a limited time. It is known that the penalty method can be improved by introducing an additional term into the function, which is called the augmented Lagrangian (penalty) method Bertsekas (1976):

This new term allows convergence to the optimal solution even when is finite. Furthermore, using the update rule , called the method of multipliers, it is known that converges to the true Lagrange multiplier of this problem corresponding to the equality constraints .

c.3 Non-unique lower-level solution

Most existing methods have assumed that the lower-level solution is unique for all . Regularization from the previous section, can improve the ill-conditioning of the Hessian but it does not address the case of multiple disconnected global minima of . With multiple lower-level solutions , there is an ambiguity in defining the upper-level problem. If we assume that is chosen adversarially (or pessimistically), then the upper-level problem should be defined as

If is chosen co-operatively (or optimistically), then the upper-level problem should be defined as

and the results can be quite different between these two cases. Note that the proposed penalty function method is naturally solving the optimistic case, as Alg. 1 is solving the problem of by alternating gradient descent. However, with a gradient-based method, we cannot hope to find all disconnected multiple solutions. In a related problem of min-max optimization, which is a special case of bilevel optimization, an algorithm for handling non-unique solutions was proposed recently Hamm and Noh (2018). This idea of keeping multiple candidate solution may be applicable to bilevel problems too and further analysis of the non-unique lower-level problem is left as future work.

c.4 Modified algorithm

Here we present the modified algorithm which incorporates regularization (Sec. C.1) and augmented Lagrangian (Sec. C.2) as discussed previously. The augmented Lagrangian term applies to both - and -update, but the regularization term applies to only the -update as its purpose is to improve the ill-conditioning of during -update. The modified penalized functions for -update and for -update are