Penalty Method for Inversion-Free Deep Bilevel Optimization

# Penalty Method for Inversion-Free Deep Bilevel Optimization

## Abstract

Bilevel optimization problems are at the center of several important machine learning problems such as hyperparameter tuning, data denoising, meta- and few-shot learning, data poisoning. Different from simultaneous or multi-objective optimization, bilevel optimization requires computing the inverse of the Hessian of the lower-level cost function to obtain the exact descent direction for the upper-level cost. In this paper, we propose a new method for solving deep bilevel optimization problems using the penalty function which avoids computing the inverse. We prove convergence of our method under mild conditions and show that it computes the exact hypergradient asymptotically. Small space and time complexity of our method enables us to solve large-scale bilevel problems involving deep neural networks with several million parameters. We present results of our method for data denoising on MNIST/CIFAR10/SVHN datasets, for few-shot learning on Omniglot/Mini-Imagenet datasets and for training-data poisoning on MNIST/Imagenet datasets. In all experiments, our method outperforms or is comparable to previously proposed methods both in terms of accuracy and run-time.

\printAffiliationsAndNotice

## 1 Introduction

Solving a bilevel optimization problem is crucial for the fields of study which involve a competition between two parties or two objectives. Particularly, a bilevel problem arises, if one party makes its choice first affecting the optimal choice for the second party, also known as the Stackelberg model, dating back to 1930’s von Stackelberg (2010). The general form of a bilevel optimization problem is

 minuf(u,v),s.t.v=argminvg(u,v). (1)

A bilevel problem with constraints is of the form , where the lower-level constraint set can depend on . However, we focus on unconstrained problems in this paper. The ‘upper-level’ problem is a usual minimization problem except that is constrained to be the solution to the ‘lower-level’ problem which is dependent on (see Bard (2013) for a review of bilevel optimization). Bilevel optimizations also appears in many important machine learning problems. For example, gradient-based hyperparameter tuning Domke (2012); Maclaurin et al. (2015); Luketina et al. (2016); Pedregosa (2016); Franceschi et al. (2017, 2018), data denoising by importance learning Liu and Tao (2016); Yu et al. (2017); Ren et al. (2018), few-shot learning Ravi and Larochelle (2017); Santoro et al. (2016); Vinyals et al. (2016); Franceschi et al. (2017); Mishra et al. (2017); Snell et al. (2017); Franceschi et al. (2018); Rajeswaran et al. (2019), and training-data poisoning Mei and Zhu (2015); Muñoz-González et al. (2017); Koh and Liang (2017); Shafahi et al. (2018). We explain each of these problems and their bilevel formulations below.

Gradient-based hyperparameter tuning. Finding hyperparameters is an indispensable step in any machine learning problem. Grid search is a popular way of finding the optimal hyperparameters, if the domain of the hyperparameters is a predetermined discrete set or a range. However, when losses are differentiable functions of the hyperparameter(s), we can find optimal hyperparameter values by solving a continuous bilevel optimization. Let and denote hyperparameter(s) and parameter(s) for a class of learning algorithms and be the hypothesis. Then is the validation loss, and is the training loss, defined similarly. The best hyperparameter(s) is then the solution to the following problem

 minuLval(u,w)s.t.w=argminwLtrain(u,w). (2)

Thus, we find the best model parameters for each choice of the hyperparameter , and select that value for the hyperparameter which incurs the smallest validation loss.

Data denoising by importance learning. A common assumption of learning is that the training examples are i.i.d. samples from the same distribution as the test data. However, if training and testing distributions are not identical or if the training examples have been corrupted by noise or modified by adversaries, this assumption is violated. In such cases, re-weighting the importance of each training example, before training, can help reduce the discrepancy between the two distributions. For example, one can up-weight the importance of the examples from the same distribution and down-weight the importance of the rest. Finding the correct weight for each training example can be formulated as a bilevel optimization. Consider to be the vector of non-negative importance values for each training example where is the number of training examples, and are the parameter(s) of a classifier . Then the weighted training error is . Assuming we have a small number of examples, from the same distribution as the test examples (validation examples) the importance learning problem is:

 (3)

Hence, the importance of each training example (vector ) is selected such that the minimizer of the weighted training loss in the lower level also minimizes the validation loss in the upper-level. The final importance values can help to identify good points from the noisy training set and the classifier obtained after solving this optimization will have superior performance compared to the model trained on the noisy data.

Meta-learning. A standard learning problem involves finding the best model from the class of hypotheses for a given task (i.e., data distribution). In contrast, meta-learning is a problem of learning a prior on the hypothesis classes (a.k.a. inductive bias) for a given set of tasks. Few-shot learning is an example of meta-learning, where a learner is trained on several related tasks, during the meta-training phase, so that it can generalize well to unseen (but related) tasks with just few examples, during the meta-testing phase. An effective approach to the few-shot learning problem is to learn a common representation for various tasks and train task specific classifiers on top of this representation. Let be the map that takes raw features to a common representation for all tasks and be the classifier for the -th task, where is the total number of tasks for training. The goal of few-shot learning is to learn both the representation map parameterized by and the set of classifiers parameterized by . Let be the validation loss of task and be the training loss defined similarly, then the bilevel problem for few-shot learning is

 minu∑iLval(u,wi)s.t.wi=argminwiLtrain(u,wi),i=1,⋯,M. (4)

For evaluation of the learned representation, during the meta-test phase, the representation is kept fixed and only the classifiers for the new tasks are trained i.e. where is the total number of tasks for testing.

Training-data poisoning. Recently, machine learning models were shown to be vulnerable to train-time attacks. Different from the test time attacks, here adversary modifies the training data so that the model learned from altered training data performs poorly/differently as compared to the model learned from clean data. The most popular train-time attack method augments the original training data with one or more ‘poisoned’ examples i.e., , assigning them arbitrary labels, to create the poisoned dataset with being the loss on the poisoned training data. The problem of finding poisoning points, that when added to the clean training data hurt the performance of the model trained on it can be formulated as

 minu−Lval(u,w)s.t.w=argminwLpoison(u,w), (5)

where the minus sign in the upper-level is used to maximize the validation loss. This is the formulation for untargeted attacks. For targeted attacks, the upper-level must minimize the validation loss with respect to the intended target labels of the attacker. Another variant of the poisoning attack, only influences the prediction of a single predetermined example. The upper-level cost for such an attack is the loss over this particular example (see Eq. (10) in the Appendix E.4.1).

Challenges of deep bilevel optimization. General bilevel optimization problems cannot be solved using simultaneous optimization of the upper- and lower-level costs. Moreover, exact solution to a bilevel optimization is known to be NP-hard even for linear cost functions Bard (1991). To add to this, recent deep learning models, with millions of variables, make it infeasible to use sophisticated methods beyond the first-order methods. But, for bilevel problems, even the first-order methods are difficult to apply since they require computation of the inverse Hessian–gradient product to get the exact hypergradient (see Sec. 2.1). Since direct inversion of the Hessian is impractical even for moderate-sized problems, many approaches have been proposed to approximate the exact hypergradient including forward/reverse-mode differentiation Maclaurin et al. (2015); Franceschi et al. (2017); Shaban et al. (2018) and approximate inversion by solving a linear system of equations Domke (2012); Pedregosa (2016); Rajeswaran et al. (2019). But, still there is a big room for improvement in these existing approaches in terms of their space and time complexities and their practical performance.

Contributions. We propose a penalty function-based algorithm (Alg. 1) for solving large-scale unconstrained bilevel optimization problems. We prove convergence of the method under mild conditions (Theorem 2) and show that it computes the exact hypergradient asymptotically (Lemma 3). We present complexity analysis of the algorithm showing that it has linear time and constant space complexity (Table 1), making our method superior to forward-mode and reverse-mode differentiation and similar to the approximate inversion based method. Small space and time complexity of our approach enables us to solve large-scale bilevel problems involving deep neural networks (Table 7 in Appendix E). We apply our method (Penalty) to solve various machine learning problems like data denoising, few-shot learning, and training-data poisoning. Penalty performs competitively to the state-of-the-art methods on simpler problems (with convex lower-level cost) and significantly outperforms other methods on complex problems (with non-convex lower-level cost), both in terms of accuracy (Sec. 3) and run-time (Table 2 and Fig. 4 in Appendix D) showing that it is an effective solver for various bilevel problems.

The remainder of the paper is organized as follows. We present and analyze the main algorithm in Sec. 2, perform comprehensive experiments in Sec. 3, and conclude the paper in Sec. 4. Due to space limitation, proofs, experimental settings and additional results are presented in the appendix. All codes are uploaded to Github.

## 2 Inversion-Free Penalty Method

Throughout the paper we have assumed that upper- and lower-level costs and are twice continuously differentiable in both and . We use and to denote gradient vectors, for the matrix , and for the Hessian matrix . Additionally, following previous works we have assumed that the lower-level solution is unique for all and that is invertible everywhere. Later in this section we discuss relaxation for some of these assumptions.

### 2.1 Hypergradient for bilevel optimization

Assuming we can express the solution to the lower-level problem explicitly, we can write the bilevel problem as an equivalent single-level problem . We can use gradient-based approach on this single-level problem and compute the total derivative , called the hypergradient, in previous approaches. Using the chain rule, the total derivative is

 dfdu=∇uf+dvdu⋅∇vf,at(u,v∗(u)) (6)

In reality, can be written explicitly only for trivial problems, but we can still compute using the implicit function theorem. Since at , we get and consequently . Thus, the hypergradient from Eq. (6) is as follows

 dfdu=∇uf+dvdu∇vf=∇uf−∇2uvg(∇2vvg)−1∇vf (7)

Existing approaches Domke (2012); Maclaurin et al. (2015); Pedregosa (2016); Franceschi et al. (2017); Shaban et al. (2018); Rajeswaran et al. (2019) can be viewed as implicit methods of approximating the hypergradient, with distinct trade-offs in efficiency and complexity.

### 2.2 Penalty function approach

A bilevel problem can be considered as a constrained optimization problem since the lower-level optimality is a constraint, in addition to any other constraint in the upper- and the lower-level problems. In this work, we focus on unconstrained bilevel problems i.e. those without any additional constraints on the upper- and lower-level problems. For solving such bilevel problems, the lower-level problem is often replaced by its necessary condition for optimality, resulting in the following problem:

 minu,vf(u,v),s.t.∇vg(u,v)=0 (8)

For general bilevel problems, Eq. (8) and Eq. (1) are not the same Dempe and Dutta (2012). But, with lower-level cost being convex in for each and the assumption that the lower-level solution is unique for each , Eq. (8) is equivalent to Eq. (1).

We now describe the penalty function approach for solving bilevel optimization. The penalty function method is a well-known approach for solving constrained optimization problems (see Bertsekas (1997) for a review) and has been previously applied for solving bilevel problems. However, it was analyzed under strict assumptions and only high-level descriptions of the algorithm were presented before Aiyoshi and Shimizu (1984); Ishizuka and Aiyoshi (1992). The penalty function , optimizes the original cost plus a quadratic penalty term (which penalizes the violation of the necessary conditions for lower-level optimality) in Eq. (8). Let be the minimum of the penalty function for a given :

 (^uk,^vk):=argminu,vf(u,v)+γk2∥∇vg(u,v)∥2 (9)

Then the following convergence result is known.

###### Theorem 1 (Theorem 8.3.1 of Bard (2013)).

Assume and are convex in for any fixed . Let be any positive () and divergent () sequence. If is the corresponding sequence of optimal solutions of the penalty function Eq. (9), then the sequence has limit points any one of which is a solution of Eq. (1).

Even though this is a strong result, its not very practical, since the minimum needs to be computed exactly for each , and moreover and need to be convex in for any . In our approach, we allow -optimal solutions of Eq. (9) and show convergence to a KKT point of Eq. (8) without requiring convexity.

###### Theorem 2.

Suppose is a positive () and convergent () sequence, and is a positive (), non-decreasing (), and divergent () sequence. Let be the sequence of approximate solutions to Eq. (9) with tolerance for all . Then any limit point of satisfies the KKT conditions of the problem in Eq. (8).

Alg. 1 describes our method in which we minimize the penalty function in Eq. (9), alternatively over and . It is essential to note that our method solves a single-level penalty function, Eq. (9), and does not need any intermediate step to compute the approximate hypergradient, unlike other methods which first approximate the solution to the lower-level problem of Eq. (1) and then use an intermediate step (solving a linear system or using reverse/forward mode differentiation) to compute the approximate hypergradient. Lemma 3 (below) shows when the approximate gradient direction , computed from Alg. 1 becomes the exact hypergradient Eq. (7) for bilevel problems.

###### Lemma 3.

Given , let be from Eq. (9). Then, .

Thus if we find the minimizer of the penalty function for given and , Alg. 1 computes the exact hypergradient Eq. (7) at . Furthermore, under the conditions of Theorem 1, as and we get the exact hypergradient asymptotically.

Improvements. A caveat to these theoretical guarantees is that some of the assumptions made for analysis may not be satisfied in practice. Here we discuss simple techniques to address these problems and improve Alg. 1 further. The first problem is related to non-convexity of the lower-level cost , creating the problem that the local minimum of can be either a minimum or a maximum of . To address this we modify the -update for Eq. (9) by adding a ‘regularization’ term to the cost Thus, the -update becomes . This only affects the optimization in the beginning; as the final solution remains unaffected with or without regularization. The second problem is that the tolerance may not be satisfied in a limited time and the optimization may terminate before becomes large enough. A cure to this is the method of multipliers and augmented Lagrangian Bertsekas (1976) which allows the penalty method to find a solution with a finite . Thus we add the term to the penalty function (Eq. (9)) to get and use the method of multiplier to update as . In summary, we use the following update rules in the paper.

 uk+1 ← uk−ρ∇u[f+γk2∥∇vg∥2+∇vgTνk] vk+1 ← vk−σ∇v[f+γk2∥∇vg∥2+∇vgTνk+λkg] νk+1 ← νk+γk∇vg.

These changes are helpful in theory but were only moderately better than original versions empirically (Appendix C).

## 3 Experiments

In this section, we evaluate the performance of the proposed penalty method (Penalty) on various machine learning problems discussed in the introduction. We compare Penalty against both bilevel and non-bilevel solutions to these problems previously reported in the literature.

### 3.1 Synthetic examples

In Fig. 2 we show examples similar to Fig. 1 but with ill-conditioned or singular Hessian for the lower-level problem. The ill-conditioning poses difficulty for the methods since the implicit function theorem requires the invertibility of the Hessian at the solution point. Compared to Fig. 1, Fig. 2 shows that only Penalty converges to the true solution despite the fact that we add regularization in ApproxGrad to improve the ill-conditioning when solving the linear systems by minimization. We ascribe the robustness of Penalty to its simplicity and to the fact that it naturally handles non-uniqueness of the lower-level solution (see Appendix C.3). Additionally, we report the wall clock times for different methods on the four examples tested here in Table 2. We can see that as we increase the number of lower-level iterations all methods get slower but Penalty is faster than both RMD and ApproxGrad. Although, Penalty is slower than GD but as shown in Fig. 1 and Fig. 2, GD does not converge to optima for most of the synthetic examples.

### 3.2 Data denoising by importance learning

Now, we evaluate the performance of Penalty for learning a classifier from a dataset with corrupted labels (training data). We pose the problem as an importance learning problem presented in Eq. (3). We evaluate the performance of the classifier learned by Penalty, with 20 lower-level updates, against the following classifiers: Oracle: classifier trained on the portion of training data with clean labels and the validation data, Val-only: classifier trained only on the validation data, Train+Val: classifier trained on the entire training and validation data, ApproxGrad: classifier trained with our implementation of ApproxGrad, with 20 lower-level and 20 linear system updates. We test the performance on MNIST, CIFAR10 and SVHN datasets with validation set sizes of 1000, 10000 and 1000 points respectively. We used convolutional neural networks (architectures described in Appendix E.2) at the lower-level for this task. Table 3 summarizes our results for this problem and shows that Penalty outperforms Val-only, Train+Val and ApproxGrad by significant margins and in fact performs very close to the Oracle classifier (which is the ideal classifier), even for high noise levels. This demonstrates that Penalty is extremely effective in solving bilevel problems involving several million variables (see Table 7 in Appendix) and shows its effectiveness at handling non-convex problems. Along with improvement in terms of accuracy over other bilevel methods like ApproxGrad, Penalty also gives better run-time per upper-level iteration, leading to a decrease in the overall wall-clock time of the experiments (Fig. 4(a) in Appendix D).

We compared Penalty against the recent method of Ren et al. which uses a meta-learning based approach and assigns weights to examples based on their gradient directions. We used the same setting as their uniform flip experiment with 36% label noise on CIFAR dataset and Wide ResNet 28-10 (WRN-28-10) model (see Appendix  E.2). Using =1 for Penalty and 1000 validation points, we get an accuracy of 87.41 0.26 (mean s.d. of 5 trials) higher than 86.92 0.19 reported by them. Due to the enormous size of WRN-28-10, we do not use larger values of , but expect it to only improve the results based on the Fig. 4(a) in Appendix D. We also compared Penalty against the RMD-based method of Franceschi et al., using the same setting as their Sec. 5.1, on a subset of MNIST data corrupted with 50% label noise and softmax regression as the model (see Appendix E.2). The accuracy of the classifier trained on a subset of the data with points having importance values greater than 0.9 (as computed by Penalty with ) along with the validation set is 90.77% better than 90.09% reported by the RMD-based method.

### 3.3 Few-shot learning

Next, we evaluate the performance of Penalty on the task of learning a common representation for the few-shot learning problem. We use the formulation presented in Eq. (4) and use Omniglot Lake et al. (2015) and Mini-ImageNet Vinyals et al. (2016) datasets for our experiments. Following the protocol proposed by Vinyals et al. (2016) for -way -shot classification, we generate meta-training and meta-testing datasets. Each meta-set is built using images from disjoint classes. For Omniglot, our meta-training set comprises of images from the first 1200 classes and the remaining 423 classes are used in the meta-testing dataset. We also augment the meta-datasets with three different rotations (90, 180 and 270 degrees) of the images as used by Santoro et al. (2016). For the experiments with Mini-Imagenet, we used the split of 64 classes in meta-training, 16 classes in meta-validation and 20 classes in meta-testing as used by Ravi and Larochelle (2017).

## 4 Conclusion

A wide range of interesting machine learning problems can be expressed as bilevel optimization problems, and new applications are still being discovered. So far, the difficulty of solving bilevel optimization has limited its wide-spread use for solving large-scale problems, specially, involving deep models. In this paper we presented an efficient algorithm based on penalty function to solve bilevel optimization, which is both simple and has theoretical and practical advantages over existing methods. As compared to previous methods we demonstrated competitive performance on problems with convex lower-level costs and significant improvement on problems with non-convex lower-level costs both in terms of accuracy and time, highlighting the practical effectiveness of our penalty-based method. In future works, we plan to tackle other challenges in bilevel optimization such as handling additional constraints in both upper- and lower-levels.

Appendix

We provide missing proofs in Appendix A, a review of other methods of hypergradient computation in Appendix B, discuss modifications to improve the Alg. 1 in Appendix C, and the experiment details and additional results in Appendix E.

## Appendix A Proofs

###### Theorem 2.

Suppose is a positive () and convergent () sequence, and is a positive (), non-decreasing (), and a divergent () sequence. Let be the sequence of approximate solutions to Eq. (9) with tolerance for all . Then any limit point of satisfies the KKT conditions of the problem Eq. (8).

###### Proof.

The proof follows the standard proof for penalty function methods, e.g., Nocedal and Wright (2006). Let refer to the pair, and let be any limit point of the sequence , then there is a subsequence such that . From the tolerance condition

 ∥∇w~f(wk;γk)∥= ∥∇wf(wk)+γk∇2wvg(wk)∇vg(wk)∥≤ϵk

we have

 ∥∇2wvg(wk)∇vg(wk)∥≤1γk[∥∇wf(wk)∥+ϵk]

Take the limit with respect to the subsequence on both sides to get

 ∇2wvg(¯¯¯¯w)∇vg(¯¯¯¯w)=0

Since is a tall matrix and is invertible by assumption, is full-rank and therefore , which is the primary feasibility condition in Eq. (8). Furthermore, let , then by definition,

 ∇w~f(wk;γk)=∇wf(wk)−∇2wvg(wk)μk

We can write

 [∇2wvg(wk)T∇2wvg(wk)]μk= ∇2wvg(wk)T[∇wf(wk)−∇w~f(wk;γk)]

The corresponding limit can be found by taking the limit of the subsequence

 ¯¯¯μ:=limk∈Kμk= [∇2wvg(¯¯¯¯w)T∇2wvg(¯¯¯¯w)]−1∇2wvg(¯¯¯¯w)T∇wf(¯¯¯¯w)

Since from the condition , we get

 ∇wf(¯¯¯¯w)−∇2wvg(¯¯¯¯w)¯μ=0

at the limit , which is the stationarity condition of Eq. (8). Together with the feasibility condition , the two KKT conditions of Eq. (8) are satisfied at the limit point. ∎

###### Lemma 3.

Given , let be of Eq. (9). Then, .

###### Proof.

At the minimum the gradient vanishes, that is .
Equivalently, . Then,

 ∇u~f(^v)=∇uf(^v)+γ∇2uvg(^v)∇vg(^v)= ∇uf(^v)−∇2uvg(^v)∇2vvg−1(^v)∇vf(^v),

where disappears, which is the hypergradient . ∎

That is, if we find the minimum of the penalty function for given and , we get the hypergradient Eq. (7) at . Furthermore, under the conditions of Theorem 1, as (see Lemma 8.3.1 of Bard (2013)), and we get the exact hypergradient asymptotically.

## Appendix B Review of other bilevel optimization methods

Several methods have been proposed to solve bilevel optimization problems appearing in machine learning, including forward/reverse-mode differentiation Maclaurin et al. (2015); Franceschi et al. (2017) and approximate gradient Domke (2012); Pedregosa (2016) described briefly here.

Forward-mode (FMD) and Reverse-mode differentiation (RMD). Domke Domke (2012), Maclaurin et al.Maclaurin et al. (2015), Franceschi et al. Franceschi et al. (2017), and Shaban et al. Shaban et al. (2018) studied forward and reverse-mode differentiation to solve the minimization problem where the lower-level variable follows a dynamical system . This setting is more general than that of a bilevel problem. However, a stable dynamical system is one that converges to a steady state and thus, the process can be considered as minimizing an energy or a potential function.

Define and , then the hypergradient Eq. (7) can be computed by

 dfdu=∇uf(u,vT)+T∑t=0BtAt+1×⋯×AT∇vf(u,vT)

When the lower-level process is one step of gradient descent on a cost function , that is,

 Φt+1(vt;u)=vt−ρ∇vg(u,vt)

we get

 At+1=I−ρ∇2vvg(u,vt),Bt+1=−ρ∇2uvg(u,vt).

is of dimension and is of dimension . The sequences and can be computed in forward or reverse mode. For reverse-mode differentiation, first compute

 vt+1=Φt+1(vt),t=0,1,⋯,T-1,

then compute

 qT←∇vf(u,vT),pT←∇uf(u,vT) pt−1←pt+Btqt,qt−1←Atqt,t=T,T-1,⋯,1.

Time and space Complexity for computing is since the Jacobian vector product can be computed in time and space. The final hypergradient for RMD is .
Hence the final time complexity for RMD is and space complexity is .

For forward-mode differentiation, simultaneously compute , , and

 P0←0,Pt+1←PtAt+1+Bt+1,t=0,1,⋯,T-1.

Time complexity for computing is since can be computed using Hessian vector products each needing and also needs using unit vectors for . The space complexity for each is . The final hypergradient for FMD is

 dfdu=∇uf(u,vT)+PT∇vf(vT).

Hence the final time complexity for FMD is and space complexity is .

Approximate hypergradient (ApproxGrad). Since computing the inverse of the Hessian directly is difficult even for moderately-sized neural networks, Domke Domke (2012) proposed to find an approximate solution to by solving the linear system of equations . This can be done by solving

 minq∥∇2vvg⋅q−∇vf∥

using gradient descent, conjugate gradient descent or any other iterative solver. Note that the minimization requires evaluation of the Hessian-vector product, which can be done in linear time Pearlmutter (1994). Hence the time complexity of the method is and space complexity is since we only need to store single copy of and same as Penalty. The asymptotic convergence with approximate solutions was shown by Pedregosa Pedregosa (2016).

## Appendix C Improvements to Algorithm  1

Here we discuss the details of the modifications to Alg. 1 presented in the main text which can be added to improve the performance of the algorithm in practice.

### c.1 Improving local convexity by regularization

One of the common assumptions of this and previous works is that is invertible and locally positive definite. Neither invertibility nor positive definiteness hold in general for bilevel problems, involving deep neural networks, and this causes difficulties in the optimization. Note that if is non-convex in , minimizing the penalty term does not neccesarily lower the cost but instead just moves the variable towards a stationary point – which is a known problem even for the Newton’s method. Thus we propose the following modification to the -update:

 minv[f+γk2∥∇vg∥2+λkg]

keeping the -update intact. To see how this affects the optimization, note that -update becomes

 v←v−ρ[fv+γk∇2vvg∇vg+λk∇vg]

After converges to a stationary point, we get
, and after plugging this into -update, we get

 u←u−σ[∇uf−∇2uvg(∇2vvg+λkγkI)−1∇vf]

that is, the Hessian inverse is replaced by a regularized version to improve the positive definiteness of the Hessian. With a decreasing or constant sequence such that the regularization does not change to solution.

### c.2 Convergence with finite γk

The penalty function method is intuitive and easy to implement, but the sequence is guaranteed to converge to an optimal solution only in the limit with , which may not be achieved in practice in a limited time. It is known that the penalty method can be improved by introducing an additional term into the function, which is called the augmented Lagrangian (penalty) method Bertsekas (1976):

 minu,v[f+γk2∥∇vg∥2+∇vgTν].

This new term allows convergence to the optimal solution even when is finite. Furthermore, using the update rule , called the method of multipliers, it is known that converges to the true Lagrange multiplier of this problem corresponding to the equality constraints .

### c.3 Non-unique lower-level solution

Most existing methods have assumed that the lower-level solution is unique for all . Regularization from the previous section, can improve the ill-conditioning of the Hessian but it does not address the case of multiple disconnected global minima of . With multiple lower-level solutions , there is an ambiguity in defining the upper-level problem. If we assume that is chosen adversarially (or pessimistically), then the upper-level problem should be defined as

 minumaxv∈Z(u)f(u,v).

If is chosen co-operatively (or optimistically), then the upper-level problem should be defined as

 minuminv∈Z(u)f(u,v),

and the results can be quite different between these two cases. Note that the proposed penalty function method is naturally solving the optimistic case, as Alg. 1 is solving the problem of by alternating gradient descent. However, with a gradient-based method, we cannot hope to find all disconnected multiple solutions. In a related problem of min-max optimization, which is a special case of bilevel optimization, an algorithm for handling non-unique solutions was proposed recently Hamm and Noh (2018). This idea of keeping multiple candidate solution may be applicable to bilevel problems too and further analysis of the non-unique lower-level problem is left as future work.

### c.4 Modified algorithm

Here we present the modified algorithm which incorporates regularization (Sec. C.1) and augmented Lagrangian (Sec. C.2) as discussed previously. The augmented Lagrangian term applies to both - and -update, but the regularization term applies to only the -update as its purpose is to improve the ill-conditioning of during -update. The modified penalized functions for -update and for -update are

 ~f1(u,v;γ,ν):=f+γ2∥∇vg∥2+∇vgTν ~f2(