Learned FineTuner for Incongruous FewShot Learning
Abstract
Modelagnostic metalearning (MAML) effectively metalearns an initialization of model parameters for fewshot learning where all learning problems share the same format of model parameters – congruous metalearning. We extend MAML to incongruous metalearning where different yet related fewshot learning problems may not share any model parameters. In this setup, we propose the use of a Learned Fine Tuner (LFT) to replace handdesigned optimizers (such as SGD) for the taskspecific finetuning. The metalearned initialization in MAML is replaced by learned optimizers based on the learningtooptimize (L2O) framework to metalearn across incongruous tasks such that models finetuned with LFT (even from random initializations) adapt quickly to new tasks. The introduction of LFT within MAML (i) offers the capability to tackle fewshot learning tasks by metalearning across incongruous yet related problems (e.g., classification over images of different sizes and model architectures), and (ii) can efficiently work with firstorder and derivativefree fewshot learning problems. Theoretically, we quantify the difference between LFT (for MAML) and L2O. Empirically, we demonstrate the effectiveness of LFT through both synthetic and real problems and a novel application of generating universal adversarial attacks across different image sources in the fewshot learning regime.
1 Introduction
Many machine learning methods are inherently data hungry, requiring large amounts of training examples for improved generalization. This limits the applicability of such methods to problems where only a few examples are available. Metalearning [1] focuses on leveraging past experiences with similar tasks to “warmstart” the learning on new tasks. In the context of learning neuralnetworks, modelagnostic metalearning (MAML) [2, 3] focuses on gradientbased learning and metalearns an initialization for a neural network (for supervised and reinforcement learning) with an explicit goal of fast adaptation – the ability to learn a good model with just a few examples (fewshot learning) and a few finetuning steps (with gradient descent). While the idea of explicitly optimizing for fast adaptation is very general, the practical interpretation of MAML as “parameter initialization” or “reusable parameters” for learning models [4] limits the scope of this general idea to the situation where the metalearning and task specific learning (finetuning) occurs on the same set of parameters and these parameters are explicitly shared between different learning tasks. Metalearning is restricted to tasks that share the same parameter format (for example, parameters of neural networks with the same architecture). We term these as congruous tasks.
However, similar tasks with different set of parameters – incongruous tasks – such as tasks involving learning networks with different architectures cannot be metalearned across with MAML. For example, focusing on image classification tasks for digits, we might wish to use a network with just 3 fully connected layers for one image set (say MNIST [5]) and a network with 3 5x5 convolutional layers and 5 fully connected layers for a different image set (say SVHN [6]). Even if the tasks are similar (digits classification), it is not clear how these networks would share network parameters that can be metalearned with MAML. This is because MAML takes advantage of the congruity of the tasks to metalearn “where to start learning from with only a few examples” – this translates to metalearning initialization for model parameters. We remark that our incongruous setting is different from the ‘heterogeneous’ multitask setting studied in [7], where the heterogeneity refers to the involvement of different data distributions but same format of parameters to be optimized across tasks (heterogeneous yet congruous in our context).
A different application of metalearning is learningtooptimize (L2O) or learningtolearn [8, 9] where the optimization trajectories from different optimization tasks with objectives of different optimizee parameters serve as examples for metalearning the optimizer parameters. These learned optimizers can generalize well to unseen optimization tasks [10, 11], and can train deep learning models better than handcrafted optimizers such as SGD, RMSProp [12] or Adam [13]. Most research has focused on differentiable objectives, but nondifferentiable ones can also be handled [14, 15, 16].
When using learned optimizers with gradients or zerothorder gradient estimates [17, 15], the optimizers can seamlessly operate on objectives with different set of optimizee variables. This allows us to metalearn across incongruous tasks. However, this metalearning is distinct for MAML based schemes on two counts: First, MAML is designed for fewshot learning problems while L2O focuses on solving general optimization problems. More importantly, there is a difference in metalearning philosophy of these two schemes – while MAML focuses on metalearning “where to start learning from”, L2O metalearns “how to learn”. While L2O can be used to metalearn an optimizer for fewshot tasks, it is not explicitly designed for that. Moreover, we are not aware of any (empirical or theoretical) comparison of MAML and L2O for fewshot learning – it is not clear which philosophy is more consequential.
Contributions.
In this paper, we interpret MAML as a general framework for explicit optimization of the fast adaptation objective, and leverage the L2O framework to metalearn “how to learn with only a few examples” across incongruous fewshot learning tasks (it is obviously applicable to congruous tasks). Specifically, we demonstrate the following:
2 Problem formulation
In this section, we first review modelagnostic meta learning (MAML) and present its inapplicability to incongruous metalearning. We then motivate the setup to generalize MAML to metalearn a finetuner instead of an initialization.
Model Agnostic MetaLearning.
MAML metalearns an initialization of optimizee variables (e.g., model parameters) that enables fast adaptation to new tasks when finetuning the optimizee from this learned initialization with only a few new examples. Formally, with fewshot learning tasks , for metalearning with task (a) a fine tuning set is used for the taskspecific inner loop in MAML to finetune the initial optimizee , and (b) a validation set is used in the outer loop for the evaluating the finetuned optimizee to metaupdate the initialization . Thus, MAML solves the following bilevel optimization problem
(1) 
where is the taskspecific optimizee and is the taskspecific loss evaluated on data using variable obtained from finetuning the metalearned initialization . Problem (1) provides a generalized formulation of MAML.
Solving the bilevel program (1) is challenging. In MAML and variants [2, 18, 19], the inner loop is a step gradient descent (GD) with the initial , the final and
(2) 
where is the step optimizee finetuned with from initialization , is a learning rate, and . Although GD (2) and variants solve the inner minimization in (1) efficiently, the outer loop requires the secondorder derivative with respect to (w.r.t.) . With large , MAML faces the issue of vanishing gradients.
In MAML, both levels of the optimization operate on the same optimizee (for example, same network parameters), and accordingly, learning tasks are restricted to problems which share the same optimizee. However, in the general metalearning setting, similar tasks could be from related yet incongruous domains corresponding to different objectives with optimizee variables of different dimensions that cannot be shared between tasks. For example, adversarial perturbation parameters cannot be shared between images from different data sources; network parameters from different architectures cannot be shared even if they are solving related learning tasks. In such cases, metalearning the initialization is not applicable. Instead, we propose to metalearn an optimizer – the finetuner – for fast adaptation of the taskspecific optimizee in a fewshot setting even when metalearning across incongruous tasks.
Learning to optimize.
The L2O framework allows us to replace the handdesigned GD (2) with a learnable recurrent neural network (RNN) parameterized by . For any task , the model mimics a handcrafted gradient based optimizer to output a descent direction to update taskspecific optimizee variable given the function gradients as input. Thus, we replace (2) with
(3) 
where denotes the state of at the RNN unrolling step, represents the gradients or gradient estimates [20, 15]. Each taskspecific gets randomly initialized.
2.1 Learned finetuners for MAML
Based on (3), we ask: Is it possible to metalearn the optimizer (finetuner) that enables fast adaptation to new tasks in the MAML inner loop? We term this learned finetuner (LFT) for incongruous fewshot learning. Combining (3) with (1), we can cast the metalearning of a LFT as
(4) 
where is an importance weight for the unrolled RNN step in (3). We can set (i) [9], (ii) [15], or (iii) [11]. Choice (iii) matches the MAML objective (1) which focuses on the final finetuned solution. However, unlike MAML, problem (4) metalearns the finetuner instead of an initialization , as depicted in Figure 1.
Comparison to L2O.
Problem (4) is a general version of the metalearning in L2O where we also metalearn the optimizer parameters . The key difference is the absence of a separate validation set – the fine tuning set is also used for the outer loop update of the RNN parameters . While this difference appears minor, we theoretically quantify this (see Theorem 1), and demonstrate that, in fewshot learning, this change leads to significant performance gains. L2O learns by minimizing the finetuning loss over the unrolled trajectory. We metalearn by directly minimizing the generalization loss (estimated with ) over the unrolled trajectory. Our empirical results show that the RNN is able to leverage this difference for improved generalization in both congruous & incongruous fewshot tasks.
(5) 
3 Algorithmic Framework for LFT
The metalearning problem (4) is a bilevel optimization, similar to MAML (1). However, both inner and outer levels are distinct from MAML: In the inner level, we update a taskspecific optimizee by unrolling for steps from a random initial state ; by contrast, MAML uses GD to update from the metalearned initialization (that is, ). In the outer level, we minimize the objective (4) w.r.t. the optimizer instead of the optimizee initialization . We present our proposed scheme in Algorithm 1.
In what follows, we discuss our proposed metalearning (Alg. 1), showcasing its (i) general ability to metalearn across incongruous tasks, (ii) applicability to zerothorder (ZO) optimization, (iii) closeformed, recursively computable metalearning gradient, (iv) theoretical difference from L2O.
Incongruous metalearning.
When finetuning the taskspecific optimizee variable by (Algorithm 1, Step 6), we can use an invariant RNN architecture to tolerate the taskspecific variations in the dimensions of optimizee variables . Recall from (3) that uses the gradient or gradient estimate as an input, which has the same dimension as . At first glance, a single seems incapable of handling incongruous defined over optimizee variables of different dimensionalities. However, if is configured as a coordinatewise Long Short Term Memory (LSTM) network (proposed by [9]), it is invariant to the dimensionality of optimizee variables since is independently applied to each coordinate of regardless of its dimensionality. In contrast to MAML, the invariant expands the application domain of LFT beyond model weights/parameters over congruous tasks to incongruous ones such as designing universal adversarial perturbations across incongruous attack tasks.
Derivativefree metalearning.
The use of L2O in (3) also allows us to update the taskspecific optimizee variable using not only firstorder (FO) information (gradients) but also zerothorder (ZO) information (function values) if the loss function is a blackbox objective function. We can estimate the gradient with finitedifferences of function values [17, 15]:
(6) 
where is a small step size (the smoothing parameter), are random directions with entries from . This gives us ZOLFT alongside our original FOLFT. The function can also be more sophisticated quantities derived from gradients as proposed in [10, 11, 16].
Metalearning gradient.
Since Algorithm 1 metalearns the optimizer variable rather than the initialization of optimizee variable , it requires a different metalearning gradient . Focusing only on in (5) and dropping the task index :
(7) 
where denotes a matrix product that the chain rule obeys [21]. Statement 1 details the recursive computation of , which calls for the secondorder (or firstorder) derivative of w.r.t. if denotes the gradient (or gradient estimate) of in (3). We refer readers to Supplement A for the details of derivation.
Statement 1
It is clear from (8) and (9) that the secondorder derivatives are involved without additional assumptions due to the presence of if is specified by the firstorder derivative w.r.t. . If it is specified by the ZO gradient estimate, then there will only be firstorder derivatives involved in (9) and (8). With the use of the coordinatewise RNN, the terms , , , correspond to diagonal matrices. Note that and . Unlike MAML, this recursively defined metalearning gradient w.r.t. is not as prone to the issue of vanishing gradients for large values of .
Theoretical Analysis.
L2O is empirically very capable metalearning from optimization trajectories, with better convergence than handcrafted optimizers. Using notation from Sec. 2, the L2O objective can be written as:
(10) 
Under standard assumptions, we show the following result, quantifying the difference between L2O and our proposed metalearning with respect to the size of the metalearning gradient w.r.t. in Algorithm 1 (Supplement B):
Theorem 1
Remark 1
When and are both large enough the difference between L2O and our proposed metalearning is small. In this case, we reduce to L2O. From previous works, we know that L2O can metalearn optimizers with good convergence properties, implying that our LFTs would also converge to similar results by leveraging the RNN structure.
Remark 2
When the data size is small – the fewshot learning regime – there could be a gap between the two frameworks, resulting in a significant difference in the solutions generated by L2O and our scheme, especially for the case where or is large. This potentially explains the significant difference between the empirical performance of L2O and our LFTs in the evaluation over fewshot learning problems.
\diagbox[width=10em,trim=l]Training Testing  MNIST  CIFAR10  MNIST + CIFAR10  










MNIST  MAML  52%  0.14  N/A  N/A  N/A  N/A  N/A  N/A  N/A  
L2O  85%  0.116  122  0%  0.05  N/A  25%  0.072  N/A  
LFT  100%  0.104  55  25%  0.055  N/A  50%  0.079  N/A  

L2O  77%  0.112  125  95%  0.069  72  92%  0.096  93  
LFT  93%  0.101  92  100%  0.063  55  100%  0.89  68 
4 Experiment: Generating Universal Attack against Hybrid Image Sources
Recent research demonstrates the lack of robustness of deep neural network (DNN) models against adversarial perturbations/attacks [23, 24, 25, 26] – imperceptible perturbations to input examples (e.g. images) – crafted to manipulate the DNN prediction. The problem of universal adversarial perturbation (UAP) seeks a single perturbation pattern to manipulate the outputs of the DNN to multiple examples simultaneously [27]. Often, this universal perturbation is learned with a set of “training” examples and then applied to unseen “test” examples. However, learning perturbations in a fewshot setting with just a few examples, while being able to successfully attack unseen examples is very challenging.
Specifically, the attacker aims to fool a welltrained DNN by perturbing input images with the UAP.Let denote the probability predicted by the DNN for input and class . Given a taskspecific data set (corresponding to a task ), the design of UAP is cast as
(12) 
where is the true label of , is a regularization parameter, and is the C&W attack loss [25], which is (indicating a successful attack) when the incorrect class is predicted as the top1 class. The second term of (12) is an regularizer, which penalizes the perturbation strength of , measured by its norm.
Metalearning for UAP generation.
A direct solution to problem (12) only ensures the attack power of UAP () against the given data set at the specific task . Like any learning problem, the set needs to be large for the UAP to successfully attack unseen examples. In the fewshot setting (small ), we want to leverage metalearning to facilitate better generalization for the learned UAP. The attack loss (12) is considered as the taskspecific loss in (4) with as the optimizee. With multiple fewshots tasks & corresponding , we metalearn the finetuner to generate UAPs for new fewshot UAP tasks. For comparison, we also consider (i) MAML to metalearn an initial UAP for each task, and (ii) L2O to metalearn for just solving (12). Experiments are conducted over two types of tasks:
(a) Congruous tasks: Tasks are drawn from the same dataset (MNIST). In this setting, the applicable methods include LFT, MAML, and L2O. We use MAML here since we can have a UAP parameter that is shared across all tasks.
(b) Incongruous tasks: Tasks are drawn from a union of different image sources (in this case MNIST & CIFAR10). Unlike (a), it is not possible to share UAP parameters across all tasks from different image sets. Hence MAML is not applicable.
Experimental setting.
We metalearn LFT, MAML and L2O with 1000 fewshot UAP tasks . In LFT and MAML, the finetuning and metaupdate sets are drawn from the training dataset and the test dataset of an image source, respectively. However, both MNIST and CIFAR10 are used across tasks. In both & , image classes with samples per class are randomly selected. In L2O, is combined with ; there is no metavalidation involved in the metalearning. We evaluate the performance of the metalearning schemes over random unseen fewshot UAP tasks (data for task is generated in a manner described above). Moreover, both LFT and L2O are finetuned from random initialization over test tasks; MAML, when applicable, starts finetuning with the metalearned initialization. We refer readers to Supplement C for more details.
\diagbox[width=6em,trim=l]TrainingTest 
MNIST 
CIFAR10 
MNIST + CIFAR10 
MNIST (Congruous tasks) 

MNIST + CIFAR10 (Incongruous tasks) 
Overall performance.
In Table 1, we present the superior performance of LFT to tackle attack tasks involving different image sets (such as MNIST & CIFAR10). Specifically, we present the averaged attack success rate (ASR), norm distortion, and number of finetuning steps required to first reach ASR (within steps) of generated UAP using different metalearners (LFT, MAML, L2O), where the victim DNN is given by the LeNet5 model [22]. Compared to MAML, LFT metalearns the highlevel “how” to generate universal attacks in a fewshot setting without being restricted to a single image set (namely, allowing the mismatch of image source between metatraining and testing). Compared to L2O, LFT is impressively effective in generating fewshot attacks on unseen image sets.
Detailed results.
In Figure 2, we present the averaged ASR of UAP over test tasks versus the number of finetuning steps. We report ASR at every combination of training and evaluation settings, denoted by the pair of datasets. For example, (MNIST, CIFAR10) implies that metalearning is performed with MNIST and then used to generate UAP against CIFAR10 at (meta)testing. And we use MNIST + CIFAR10 to represent the union of MNIST and CIFAR10 tasks for metalearning (or metatesting). The results on distortion strength of UAP are shown in Figure A1 in the supplement. Briefly, we find that LFT yields an UAP generator with fastest adaption, highest ASR, and lowest attack distortion than MAML and L2O; see details as below.
LFT significantly outperforms MAML and L2O when metalearning and metatesting with congruous UAP tasks (MNIST, MNIST). As shown in Figure 2 & A1 (and Table 1), the significance lies at three aspects. (i) Fewest finetuning steps are required to attack new tasks with ASR; (ii) Highest ASR can be achieved at a given number of finetuning steps; (iii) Lowest perturbation strength is needed to achieve the most significant attacking power. We also note that L2O yields better performance (higher ASR and lower distortion) than MAML. This indicates that metalearning the optimizer () could offer better generalization than metalearning the optimizee (). In these fewshot UAP tasks, the “how to learn” seems more useful than “where to learn from”.
On the other hand, LFT outperforms L2O in the standard transfer attack settings, corresponding to the scenario (MNIST, CIFAR10), where the generator of UAP is learnt over MNIST, but tested over CIFAR10. Note that MAML can not be applied to this scenario, since the metalearned UAP initialization does not have the same dimension as the test data. Compared to the congruous setting (MNIST, MNIST), the ASR decreases from to for LFT and L2O. However, LFT adapts better to unseen tasks. LFT also outperforms L2O in cases that involve incongruous tasks drawn from MNIST + CIFAR10. One interesting observation is that the use of hybrid data sources (incongruous tasks) during metatraining enables the learned finetuner to generate UAP with faster adaptation on unseen images; compare rows 1 & 2 of Figure 2. In Supplement E, we present additional results, comparison to whitebox attacks, and visualization of UAP patterns.
\diagbox[width=10em,trim=l]Training Testing 




(CIFAR10, CNN)  MAML  N/A  66%0.8%  N/A  
L2O  27%2.1%  63%1.4%  44%1.4%  
LFT  35%1.9%  66%1.3%  47%1.5%  

MAML  N/A  N/A  N/A  
L2O  81%1.0%  53%1.2%  68%1.1%  
LFT  83%0.9%  57%1.3%  72%1.2% 
5 Experiments in FewShot Classification and Regression
Application to Image Classification Using Hybrid DNN Models.
In this experiment, we consider to learn DNNbased image classifiers over 2way 5shot learning tasks. These tasks are drawn from two image sources, MNIST & CIFAR10. We specify the classifier to be trained as a 3layer multilayer perceptron (MLP) for MNIST data and a convolutional neural network (CNN) with four CONV layers for CIFAR10 data. Thus, the taskspecific optimizee in (4) corresponds to the DNN parameters for a given task. The incongruous tasks are then given by learning DNNs with different architectures (MLP and CNN). We also consider the congruous setting where only one model architecture is adopted across tasks.
In Table 2, we evaluate the classification accuracy of the model parameters acquired from LFT, MAML and L2O over randomly selected 2way 5shot test tasks. Here tasks acquired from different data sources (MNIST or CIFAR10) correspond to different model architectures (MLP or CNN). Our proposed LFT is able to match MAML for congruous tasks (metalearning with tasks from (CIFAR10, CNN) and metatesting with tasks from same set). MAML is not applicable for the incongruous tasks (at metalearning or metatesting). L2O and LFT are applicable, and LFT achieves higher accuracy () throughout. Figure 3 shows that LFT can outperform L2O by up to or more when considering a smaller number of finetuning steps. More results on other fewshot tasks are presented in Supplement F.
Application to supervised regression
In the next experiment, we revisit the sine wave regression problem presented in MAML [2, Sec. 5.1]. Here multiple fewshot regression tasks are generated by randomly varying the amplitude and the phase of a sinusoid. The goal is to learn a regression model to gain fast adaption from observed fewshot regression tasks to unobserved regression tasks. In our experiments, we choose a MLP model with 2 hidden layers as the regressor to be learnt. In LFT, the regressor’s parameters are regarded as the optimizee variable, and is learnt to obtain a regressor of small prediction error, in terms of mean squared error (MSE), even from a random initialization after a few finetuning steps during testing phase. Please refer to Supplement G for more details on the setup.
Figure 4 presents the fewshot test MSE versus the finetuning iterations. In this example, we consider both FO and ZO variants of metalearners based on gradient and gradient estimates, respectively. As we can see, LFT outperforms L2O and MAML, and the use of ZOLFT could be as effective as FOLFT. We highlight that MAML uses the learnt metainitialization, but LFT just finetunes from a random initialization. This suggests that learning the optimizer/metafinetuner (namely, ) provides better generalization ability than learning the metainitialization of optimizee variable.
6 Related work
MAML has been extremely useful in supervised and reinforcement learning (RL), and has been “reframed as a graphical model inference problem” that allows for modeling uncertainty [28]. Grant et al. [29] provide a closely related but distinct interpretation of MAML as “inference for the parameters for a prior distribution in a hierarchical Bayesian model”. The higher order derivatives through the finetuning trajectory and the consequent vanishing gradients during the metalearning is addressed with “implicit gradients” that only depend on the final finetuned result [30]. Specific to RL, various enhancements obviate the second order derivatives of the RL reward function, such as variance reduced policy gradients [31] and Monte Carlo zerothorder Evolution Strategies gradients [32].
MAML requires access to the computationally expensive secondorder information of the loss. Fallah et al. [33] study FOMAML which ignores the secondorder term and shows that if the finetuning learning rate is small or the tasks are statistically “close” to each other, the firstorder approximation induces negligible error. They also propose HFMAML, which recovers the guarantees for MAML while avoiding the Hessian computation. Yao et al. [7] propose a hierarchically structured metalearning (HSML) algorithm that explicitly tailors the transferable knowledge to different task clusters. The core idea is to perform clusterspecific metalearning, resulting in tighter generalization bounds. Ji et al. [34] present a Hessianfree MAML with multiple finetuning steps.
Learned optimizers have long been considered in the context of training neural networks [35, 36, 37]. More recent work has posed optimization with gradients as a reinforcement learning problem [8] or as learning a recurrent neural network (RNN) [9] instead of leveraging the usual handcrafted optimizers (such as SGD, RMSProp [12], Adam [13]). The RNN based optimizers have been improved [10, 11] by – (i) utilizing hierarchical RNNs that capture the parameter structure in the optimization of DL models, (ii) using handcraftedoptimizerinspired inputs to the RNN (such as momentum), and (iii) using a diverse set of optimization objectives (with different hardness levels) to train the RNN. The learned optimizers have also been successful with particle swarm optimization [16] and zerothorder gradient estimates [15]. However, at this point, there are no theoretical guarantees for learned optimizers.
7 Conclusion
MAML metalearns an effective initialization of model parameters in fewshot learning. In this paper, we generalize MAML to incongruous fewshot learning by replacing the handdesigned optimizer in the inner finetuning loop of MAML with a learned finetuner (LFT) in the form of a recurrent neural network (RNN). We show that the LFTs can be metalearned across incongruous tasks and then applied to fewshot problems. We also theoretically quantify the difference between our proposed metalearning scheme and L2O, highlighting why our proposed metalearning scheme would outperform L2O for incongruous fewshot learning. Empirically, we consider a novel application of metalearning to generate universal adversarial perturbations and show the superior performance of our LFTs over the stateoftheart metalearners. We also show how our LFTs are outperform existing metalearning schemes for incongruous fewshot classification and regression (when applicable).
Supplementary Material
Appendix A Gradients of MAML loss with respect to Rnn parameters
Based on and (3), we obtain
(S1) 
For ease of presentation, we use to represent , and the RNN output of is omitted when its meaning can clearly be inferred from the context. We then have
(S2)  
(S3)  
(S4)  
(S5) 
where the equality holds by chain rule [21], denotes a matrix product that the chain rule obeys, and the term (S5) denotes the derivative w.r.t. by fixing and as constants.
(S6) 
Next, we simplify the term (S4). Let . Note that depends on . So we write
(S7) 
Substituting (S6) and (S7) into (S2), we can then express (S1) as:
(S8) 
where is determined by the recursion (S7).
It is clear from (S7) and (S8) that the second order derivative would at most be involved due to the presence of if is specified by the firstorder derivative w.r.t. . By contrast, if it is specified by the ZO gradient estimate, then there will only be firstorder derivatives involved in (S7) and (S8). Lastly, we remark that the recursive forms of (S7) and (S8) facilitate our computation, and and .
Appendix B Proof of Theorem 1
Before showing the theoretical results, we first give the following a blanket of assumptions.
b.1 Assumptions
In practice, the size of data and variables are limited and the function is also bounded. To proceed, we have the following standard assumptions for quantifying the gradient difference between L2O and LFT.
A1. We assume that gradient estimate is unbiased, i.e.,
(S9) 
where denotes the training/validation data sample of the th task, and stands for .
A2. We assume that the gradient estimate has bounded variance for both , i.e.,
(S10) 
The same assumption is also applied for :
(S11)  
(S12) 
A3. We assume that the size of gradient is uniformly upper bounded, i.e., , .
Proof. Assume that A1–A3 hold. Let . From the definitions of and , we have
(S13)  
(S14)  
(S15)  
(S16)  
(S17)  
(S18)  
(S19) 
where in we use Jensen’s inequality, in we apply the triangle inequality, in we use the chain rule, is true because
(S20)  
(S21)  
(S22) 
where in we use CauchySchwarz inequality, in we use Jensen’s inequality; and similarly we have