Learned Fine-Tuner for Incongruous Few-Shot Learning
Abstract
Model-agnostic meta-learning (MAML) effectively meta-learns an initialization of model parameters for few-shot learning where all learning problems share the same format of model parameters – congruous meta-learning. We extend MAML to incongruous meta-learning where different yet related few-shot learning problems may not share any model parameters. In this setup, we propose the use of a Learned Fine Tuner (LFT) to replace hand-designed optimizers (such as SGD) for the task-specific fine-tuning. The meta-learned initialization in MAML is replaced by learned optimizers based on the learning-to-optimize (L2O) framework to meta-learn across incongruous tasks such that models fine-tuned with LFT (even from random initializations) adapt quickly to new tasks. The introduction of LFT within MAML (i) offers the capability to tackle few-shot learning tasks by meta-learning across incongruous yet related problems (e.g., classification over images of different sizes and model architectures), and (ii) can efficiently work with first-order and derivative-free few-shot learning problems. Theoretically, we quantify the difference between LFT (for MAML) and L2O. Empirically, we demonstrate the effectiveness of LFT through both synthetic and real problems and a novel application of generating universal adversarial attacks across different image sources in the few-shot learning regime.
1 Introduction
Many machine learning methods are inherently data hungry, requiring large amounts of training examples for improved generalization. This limits the applicability of such methods to problems where only a few examples are available. Meta-learning [1] focuses on leveraging past experiences with similar tasks to “warm-start” the learning on new tasks. In the context of learning neural-networks, model-agnostic meta-learning (MAML) [2, 3] focuses on gradient-based learning and meta-learns an initialization for a neural network (for supervised and reinforcement learning) with an explicit goal of fast adaptation – the ability to learn a good model with just a few examples (few-shot learning) and a few fine-tuning steps (with gradient descent). While the idea of explicitly optimizing for fast adaptation is very general, the practical interpretation of MAML as “parameter initialization” or “reusable parameters” for learning models [4] limits the scope of this general idea to the situation where the meta-learning and task specific learning (fine-tuning) occurs on the same set of parameters and these parameters are explicitly shared between different learning tasks. Meta-learning is restricted to tasks that share the same parameter format (for example, parameters of neural networks with the same architecture). We term these as congruous tasks.
However, similar tasks with different set of parameters – incongruous tasks – such as tasks involving learning networks with different architectures cannot be meta-learned across with MAML. For example, focusing on image classification tasks for digits, we might wish to use a network with just 3 fully connected layers for one image set (say MNIST [5]) and a network with 3 5x5 convolutional layers and 5 fully connected layers for a different image set (say SVHN [6]). Even if the tasks are similar (digits classification), it is not clear how these networks would share network parameters that can be meta-learned with MAML. This is because MAML takes advantage of the congruity of the tasks to meta-learn “where to start learning from with only a few examples” – this translates to meta-learning initialization for model parameters. We remark that our incongruous setting is different from the ‘heterogeneous’ multi-task setting studied in [7], where the heterogeneity refers to the involvement of different data distributions but same format of parameters to be optimized across tasks (heterogeneous yet congruous in our context).
A different application of meta-learning is learning-to-optimize (L2O) or learning-to-learn [8, 9] where the optimization trajectories from different optimization tasks with objectives of different optimizee parameters serve as examples for meta-learning the optimizer parameters. These learned optimizers can generalize well to unseen optimization tasks [10, 11], and can train deep learning models better than hand-crafted optimizers such as SGD, RMS-Prop [12] or Adam [13]. Most research has focused on differentiable objectives, but non-differentiable ones can also be handled [14, 15, 16].
When using learned optimizers with gradients or zeroth-order gradient estimates [17, 15], the optimizers can seamlessly operate on objectives with different set of optimizee variables. This allows us to meta-learn across incongruous tasks. However, this meta-learning is distinct for MAML based schemes on two counts: First, MAML is designed for few-shot learning problems while L2O focuses on solving general optimization problems. More importantly, there is a difference in meta-learning philosophy of these two schemes – while MAML focuses on meta-learning “where to start learning from”, L2O meta-learns “how to learn”. While L2O can be used to meta-learn an optimizer for few-shot tasks, it is not explicitly designed for that. Moreover, we are not aware of any (empirical or theoretical) comparison of MAML and L2O for few-shot learning – it is not clear which philosophy is more consequential.

Contributions.
In this paper, we interpret MAML as a general framework for explicit optimization of the fast adaptation objective, and leverage the L2O framework to meta-learn “how to learn with only a few examples” across incongruous few-shot learning tasks (it is obviously applicable to congruous tasks). Specifically, we demonstrate the following:
2 Problem formulation
In this section, we first review model-agnostic meta learning (MAML) and present its inapplicability to incongruous meta-learning. We then motivate the setup to generalize MAML to meta-learn a fine-tuner instead of an initialization.
Model Agnostic Meta-Learning.
MAML meta-learns an initialization of optimizee variables (e.g., model parameters) that enables fast adaptation to new tasks when fine-tuning the optimizee from this learned initialization with only a few new examples. Formally, with few-shot learning tasks , for meta-learning with task (a) a fine tuning set is used for the task-specific inner loop in MAML to fine-tune the initial optimizee , and (b) a validation set is used in the outer loop for the evaluating the fine-tuned optimizee to meta-update the initialization . Thus, MAML solves the following bi-level optimization problem
(1) |
where is the task-specific optimizee and is the task-specific loss evaluated on data using variable obtained from fine-tuning the meta-learned initialization . Problem (1) provides a generalized formulation of MAML.
Solving the bi-level program (1) is challenging. In MAML and variants [2, 18, 19], the inner loop is a -step gradient descent (GD) with the initial , the final and
(2) |
where is the -step optimizee fine-tuned with from initialization , is a learning rate, and . Although GD (2) and variants solve the inner minimization in (1) efficiently, the outer loop requires the second-order derivative with respect to (w.r.t.) . With large , MAML faces the issue of vanishing gradients.
In MAML, both levels of the optimization operate on the same optimizee (for example, same network parameters), and accordingly, learning tasks are restricted to problems which share the same optimizee. However, in the general meta-learning setting, similar tasks could be from related yet incongruous domains corresponding to different objectives with optimizee variables of different dimensions that cannot be shared between tasks. For example, adversarial perturbation parameters cannot be shared between images from different data sources; network parameters from different architectures cannot be shared even if they are solving related learning tasks. In such cases, meta-learning the initialization is not applicable. Instead, we propose to meta-learn an optimizer – the fine-tuner – for fast adaptation of the task-specific optimizee in a few-shot setting even when meta-learning across incongruous tasks.
Learning to optimize.
The L2O framework allows us to replace the hand-designed GD (2) with a learnable recurrent neural network (RNN) parameterized by . For any task , the model mimics a hand-crafted gradient based optimizer to output a descent direction to update task-specific optimizee variable given the function gradients as input. Thus, we replace (2) with
(3) |
where denotes the state of at the RNN unrolling step, represents the gradients or gradient estimates [20, 15]. Each task-specific gets randomly initialized.
2.1 Learned fine-tuners for MAML
Based on (3), we ask: Is it possible to meta-learn the optimizer (fine-tuner) that enables fast adaptation to new tasks in the MAML inner loop? We term this learned fine-tuner (LFT) for incongruous few-shot learning. Combining (3) with (1), we can cast the meta-learning of a LFT as
(4) |
where is an importance weight for the unrolled RNN step in (3). We can set (i) [9], (ii) [15], or (iii) [11]. Choice (iii) matches the MAML objective (1) which focuses on the final fine-tuned solution. However, unlike MAML, problem (4) meta-learns the fine-tuner instead of an initialization , as depicted in Figure 1.
Comparison to L2O.
Problem (4) is a general version of the meta-learning in L2O where we also meta-learn the optimizer parameters . The key difference is the absence of a separate validation set – the fine tuning set is also used for the outer loop update of the RNN parameters . While this difference appears minor, we theoretically quantify this (see Theorem 1), and demonstrate that, in few-shot learning, this change leads to significant performance gains. L2O learns by minimizing the fine-tuning loss over the unrolled trajectory. We meta-learn by directly minimizing the generalization loss (estimated with ) over the unrolled trajectory. Our empirical results show that the RNN is able to leverage this difference for improved generalization in both congruous & incongruous few-shot tasks.
(5) |
3 Algorithmic Framework for LFT
The meta-learning problem (4) is a bi-level optimization, similar to MAML (1). However, both inner and outer levels are distinct from MAML: In the inner level, we update a task-specific optimizee by unrolling for steps from a random initial state ; by contrast, MAML uses GD to update from the meta-learned initialization (that is, ). In the outer level, we minimize the objective (4) w.r.t. the optimizer instead of the optimizee initialization . We present our proposed scheme in Algorithm 1.
In what follows, we discuss our proposed meta-learning (Alg. 1), showcasing its (i) general ability to meta-learn across incongruous tasks, (ii) applicability to zeroth-order (ZO) optimization, (iii) close-formed, recursively computable meta-learning gradient, (iv) theoretical difference from L2O.
Incongruous meta-learning.
When fine-tuning the task-specific optimizee variable by (Algorithm 1, Step 6), we can use an invariant RNN architecture to tolerate the task-specific variations in the dimensions of optimizee variables . Recall from (3) that uses the gradient or gradient estimate as an input, which has the same dimension as . At first glance, a single seems incapable of handling incongruous defined over optimizee variables of different dimensionalities. However, if is configured as a coordinate-wise Long Short Term Memory (LSTM) network (proposed by [9]), it is invariant to the dimensionality of optimizee variables since is independently applied to each coordinate of regardless of its dimensionality. In contrast to MAML, the invariant expands the application domain of LFT beyond model weights/parameters over congruous tasks to incongruous ones such as designing universal adversarial perturbations across incongruous attack tasks.
Derivative-free meta-learning.
The use of L2O in (3) also allows us to update the task-specific optimizee variable using not only first-order (FO) information (gradients) but also zeroth-order (ZO) information (function values) if the loss function is a black-box objective function. We can estimate the gradient with finite-differences of function values [17, 15]:
(6) |
where is a small step size (the smoothing parameter), are random directions with entries from . This gives us ZO-LFT alongside our original FO-LFT. The function can also be more sophisticated quantities derived from gradients as proposed in [10, 11, 16].
Meta-learning gradient.
Since Algorithm 1 meta-learns the optimizer variable rather than the initialization of optimizee variable , it requires a different meta-learning gradient . Focusing only on in (5) and dropping the task index :
(7) |
where denotes a matrix product that the chain rule obeys [21]. Statement 1 details the recursive computation of , which calls for the second-order (or first-order) derivative of w.r.t. if denotes the gradient (or gradient estimate) of in (3). We refer readers to Supplement A for the details of derivation.
Statement 1
It is clear from (8) and (9) that the second-order derivatives are involved without additional assumptions due to the presence of if is specified by the first-order derivative w.r.t. . If it is specified by the ZO gradient estimate, then there will only be first-order derivatives involved in (9) and (8). With the use of the coordinate-wise RNN, the terms , , , correspond to diagonal matrices. Note that and . Unlike MAML, this recursively defined meta-learning gradient w.r.t. is not as prone to the issue of vanishing gradients for large values of .
Theoretical Analysis.
L2O is empirically very capable meta-learning from optimization trajectories, with better convergence than hand-crafted optimizers. Using notation from Sec. 2, the L2O objective can be written as:
(10) |
Under standard assumptions, we show the following result, quantifying the difference between L2O and our proposed meta-learning with respect to the size of the meta-learning gradient w.r.t. in Algorithm 1 (Supplement B):
Theorem 1
Remark 1
When and are both large enough the difference between L2O and our proposed meta-learning is small. In this case, we reduce to L2O. From previous works, we know that L2O can meta-learn optimizers with good convergence properties, implying that our LFTs would also converge to similar results by leveraging the RNN structure.
Remark 2
When the data size is small – the few-shot learning regime – there could be a gap between the two frameworks, resulting in a significant difference in the solutions generated by L2O and our scheme, especially for the case where or is large. This potentially explains the significant difference between the empirical performance of L2O and our LFTs in the evaluation over few-shot learning problems.
\diagbox[width=10em,trim=l]Training Testing | MNIST | CIFAR-10 | MNIST + CIFAR-10 | ||||||||||||||||
|
|
|
|
|
|
|
|
|
|||||||||||
MNIST | MAML | 52% | 0.14 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | |||||||||
L2O | 85% | 0.116 | 122 | 0% | 0.05 | N/A | 25% | 0.072 | N/A | ||||||||||
LFT | 100% | 0.104 | 55 | 25% | 0.055 | N/A | 50% | 0.079 | N/A | ||||||||||
|
L2O | 77% | 0.112 | 125 | 95% | 0.069 | 72 | 92% | 0.096 | 93 | |||||||||
LFT | 93% | 0.101 | 92 | 100% | 0.063 | 55 | 100% | 0.89 | 68 |
4 Experiment: Generating Universal Attack against Hybrid Image Sources
Recent research demonstrates the lack of robustness of deep neural network (DNN) models against adversarial perturbations/attacks [23, 24, 25, 26] – imperceptible perturbations to input examples (e.g. images) – crafted to manipulate the DNN prediction. The problem of universal adversarial perturbation (UAP) seeks a single perturbation pattern to manipulate the outputs of the DNN to multiple examples simultaneously [27]. Often, this universal perturbation is learned with a set of “training” examples and then applied to unseen “test” examples. However, learning perturbations in a few-shot setting with just a few examples, while being able to successfully attack unseen examples is very challenging.
Specifically, the attacker aims to fool a well-trained DNN by perturbing input images with the UAP.Let denote the probability predicted by the DNN for input and class . Given a task-specific data set (corresponding to a task ), the design of UAP is cast as
(12) |
where is the true label of , is a regularization parameter, and is the C&W attack loss [25], which is (indicating a successful attack) when the incorrect class is predicted as the top-1 class. The second term of (12) is an regularizer, which penalizes the perturbation strength of , measured by its norm.
Meta-learning for UAP generation.
A direct solution to problem (12) only ensures the attack power of UAP () against the given data set at the specific task . Like any learning problem, the set needs to be large for the UAP to successfully attack unseen examples. In the few-shot setting (small ), we want to leverage meta-learning to facilitate better generalization for the learned UAP. The attack loss (12) is considered as the task-specific loss in (4) with as the optimizee. With multiple few-shots tasks & corresponding , we meta-learn the fine-tuner to generate UAPs for new few-shot UAP tasks. For comparison, we also consider (i) MAML to meta-learn an initial UAP for each task, and (ii) L2O to meta-learn for just solving (12). Experiments are conducted over two types of tasks:
(a) Congruous tasks: Tasks are drawn from the same dataset (MNIST). In this setting, the applicable methods include LFT, MAML, and L2O. We use MAML here since we can have a UAP parameter that is shared across all tasks.
(b) Incongruous tasks: Tasks are drawn from a union of different image sources (in this case MNIST & CIFAR-10). Unlike (a), it is not possible to share UAP parameters across all tasks from different image sets. Hence MAML is not applicable.
Experimental setting.
We meta-learn LFT, MAML and L2O with 1000 few-shot UAP tasks . In LFT and MAML, the fine-tuning and meta-update sets are drawn from the training dataset and the test dataset of an image source, respectively. However, both MNIST and CIFAR-10 are used across tasks. In both & , image classes with samples per class are randomly selected. In L2O, is combined with ; there is no meta-validation involved in the meta-learning. We evaluate the performance of the meta-learning schemes over random unseen few-shot UAP tasks (data for task is generated in a manner described above). Moreover, both LFT and L2O are fine-tuned from random initialization over test tasks; MAML, when applicable, starts fine-tuning with the meta-learned initialization. We refer readers to Supplement C for more details.
\diagbox[width=6em,trim=l]TrainingTest |
MNIST |
CIFAR-10 |
MNIST + CIFAR-10 |
MNIST (Congruous tasks) |
![]() |
![]() |
![]() |
MNIST + CIFAR-10 (Incongruous tasks) |
![]() |
![]() |
![]() |
Overall performance.
In Table 1, we present the superior performance of LFT to tackle attack tasks involving different image sets (such as MNIST & CIFAR10). Specifically, we present the averaged attack success rate (ASR), -norm distortion, and number of fine-tuning steps required to first reach ASR (within steps) of generated UAP using different meta-learners (LFT, MAML, L2O), where the victim DNN is given by the LeNet-5 model [22]. Compared to MAML, LFT meta-learns the high-level “how” to generate universal attacks in a few-shot setting without being restricted to a single image set (namely, allowing the mismatch of image source between meta-training and testing). Compared to L2O, LFT is impressively effective in generating few-shot attacks on unseen image sets.
Detailed results.
In Figure 2, we present the averaged ASR of UAP over test tasks versus the number of fine-tuning steps. We report ASR at every combination of training and evaluation settings, denoted by the pair of datasets. For example, (MNIST, CIFAR-10) implies that meta-learning is performed with MNIST and then used to generate UAP against CIFAR-10 at (meta-)testing. And we use MNIST + CIFAR-10 to represent the union of MNIST and CIFAR-10 tasks for meta-learning (or meta-testing). The results on distortion strength of UAP are shown in Figure A1 in the supplement. Briefly, we find that LFT yields an UAP generator with fastest adaption, highest ASR, and lowest attack distortion than MAML and L2O; see details as below.
LFT significantly outperforms MAML and L2O when meta-learning and meta-testing with congruous UAP tasks (MNIST, MNIST). As shown in Figure 2 & A1 (and Table 1), the significance lies at three aspects. (i) Fewest fine-tuning steps are required to attack new tasks with ASR; (ii) Highest ASR can be achieved at a given number of fine-tuning steps; (iii) Lowest perturbation strength is needed to achieve the most significant attacking power. We also note that L2O yields better performance (higher ASR and lower distortion) than MAML. This indicates that meta-learning the optimizer () could offer better generalization than meta-learning the optimizee (). In these few-shot UAP tasks, the “how to learn” seems more useful than “where to learn from”.
On the other hand, LFT outperforms L2O in the standard transfer attack settings, corresponding to the scenario (MNIST, CIFAR-10), where the generator of UAP is learnt over MNIST, but tested over CIFAR-10. Note that MAML can not be applied to this scenario, since the meta-learned UAP initialization does not have the same dimension as the test data. Compared to the congruous setting (MNIST, MNIST), the ASR decreases from to for LFT and L2O. However, LFT adapts better to unseen tasks. LFT also outperforms L2O in cases that involve incongruous tasks drawn from MNIST + CIFAR-10. One interesting observation is that the use of hybrid data sources (incongruous tasks) during meta-training enables the learned fine-tuner to generate UAP with faster adaptation on unseen images; compare rows 1 & 2 of Figure 2. In Supplement E, we present additional results, comparison to white-box attacks, and visualization of UAP patterns.
\diagbox[width=10em,trim=l]Training Testing |
|
|
|
|||||||
(CIFAR-10, CNN) | MAML | N/A | 66%0.8% | N/A | ||||||
L2O | 27%2.1% | 63%1.4% | 44%1.4% | |||||||
LFT | 35%1.9% | 66%1.3% | 47%1.5% | |||||||
|
MAML | N/A | N/A | N/A | ||||||
L2O | 81%1.0% | 53%1.2% | 68%1.1% | |||||||
LFT | 83%0.9% | 57%1.3% | 72%1.2% |
max width=
max width=
5 Experiments in Few-Shot Classification and Regression
Application to Image Classification Using Hybrid DNN Models.
In this experiment, we consider to learn DNN-based image classifiers over 2-way 5-shot learning tasks. These tasks are drawn from two image sources, MNIST & CIFAR-10. We specify the classifier to be trained as a 3-layer multilayer perceptron (MLP) for MNIST data and a convolutional neural network (CNN) with four CONV layers for CIFAR-10 data. Thus, the task-specific optimizee in (4) corresponds to the DNN parameters for a given task. The incongruous tasks are then given by learning DNNs with different architectures (MLP and CNN). We also consider the congruous setting where only one model architecture is adopted across tasks.
In Table 2, we evaluate the classification accuracy of the model parameters acquired from LFT, MAML and L2O over randomly selected 2-way 5-shot test tasks. Here tasks acquired from different data sources (MNIST or CIFAR-10) correspond to different model architectures (MLP or CNN). Our proposed LFT is able to match MAML for congruous tasks (meta-learning with tasks from (CIFAR-10, CNN) and meta-testing with tasks from same set). MAML is not applicable for the incongruous tasks (at meta-learning or meta-testing). L2O and LFT are applicable, and LFT achieves higher accuracy () throughout. Figure 3 shows that LFT can outperform L2O by up to or more when considering a smaller number of fine-tuning steps. More results on other few-shot tasks are presented in Supplement F.
Application to supervised regression
In the next experiment, we revisit the sine wave regression problem presented in MAML [2, Sec. 5.1]. Here multiple few-shot regression tasks are generated by randomly varying the amplitude and the phase of a sinusoid. The goal is to learn a regression model to gain fast adaption from observed few-shot regression tasks to unobserved regression tasks. In our experiments, we choose a MLP model with 2 hidden layers as the regressor to be learnt. In LFT, the regressor’s parameters are regarded as the optimizee variable, and is learnt to obtain a regressor of small prediction error, in terms of mean squared error (MSE), even from a random initialization after a few fine-tuning steps during testing phase. Please refer to Supplement G for more details on the setup.
Figure 4 presents the few-shot test MSE versus the fine-tuning iterations. In this example, we consider both FO and ZO variants of meta-learners based on gradient and gradient estimates, respectively. As we can see, LFT outperforms L2O and MAML, and the use of ZO-LFT could be as effective as FO-LFT. We highlight that MAML uses the learnt meta-initialization, but LFT just fine-tunes from a random initialization. This suggests that learning the optimizer/meta-fine-tuner (namely, ) provides better generalization ability than learning the meta-initialization of optimizee variable.
6 Related work
MAML has been extremely useful in supervised and reinforcement learning (RL), and has been “reframed as a graphical model inference problem” that allows for modeling uncertainty [28]. Grant et al. [29] provide a closely related but distinct interpretation of MAML as “inference for the parameters for a prior distribution in a hierarchical Bayesian model”. The higher order derivatives through the fine-tuning trajectory and the consequent vanishing gradients during the meta-learning is addressed with “implicit gradients” that only depend on the final fine-tuned result [30]. Specific to RL, various enhancements obviate the second order derivatives of the RL reward function, such as variance reduced policy gradients [31] and Monte Carlo zeroth-order Evolution Strategies gradients [32].
MAML requires access to the computationally expensive second-order information of the loss. Fallah et al. [33] study FO-MAML which ignores the second-order term and shows that if the fine-tuning learning rate is small or the tasks are statistically “close” to each other, the first-order approximation induces negligible error. They also propose HF-MAML, which recovers the guarantees for MAML while avoiding the Hessian computation. Yao et al. [7] propose a hierarchically structured meta-learning (HSML) algorithm that explicitly tailors the transferable knowledge to different task clusters. The core idea is to perform cluster-specific meta-learning, resulting in tighter generalization bounds. Ji et al. [34] present a Hessian-free MAML with multiple fine-tuning steps.
Learned optimizers have long been considered in the context of training neural networks [35, 36, 37]. More recent work has posed optimization with gradients as a reinforcement learning problem [8] or as learning a recurrent neural network (RNN) [9] instead of leveraging the usual hand-crafted optimizers (such as SGD, RMSProp [12], Adam [13]). The RNN based optimizers have been improved [10, 11] by – (i) utilizing hierarchical RNNs that capture the parameter structure in the optimization of DL models, (ii) using hand-crafted-optimizer-inspired inputs to the RNN (such as momentum), and (iii) using a diverse set of optimization objectives (with different hardness levels) to train the RNN. The learned optimizers have also been successful with particle swarm optimization [16] and zeroth-order gradient estimates [15]. However, at this point, there are no theoretical guarantees for learned optimizers.
7 Conclusion
MAML meta-learns an effective initialization of model parameters in few-shot learning. In this paper, we generalize MAML to incongruous few-shot learning by replacing the hand-designed optimizer in the inner fine-tuning loop of MAML with a learned fine-tuner (LFT) in the form of a recurrent neural network (RNN). We show that the LFTs can be meta-learned across incongruous tasks and then applied to few-shot problems. We also theoretically quantify the difference between our proposed meta-learning scheme and L2O, highlighting why our proposed meta-learning scheme would outperform L2O for incongruous few-shot learning. Empirically, we consider a novel application of meta-learning to generate universal adversarial perturbations and show the superior performance of our LFTs over the state-of-the-art meta-learners. We also show how our LFTs are outperform existing meta-learning schemes for incongruous few-shot classification and regression (when applicable).
Supplementary Material
Appendix A Gradients of MAML loss with respect to Rnn parameters
Based on and (3), we obtain
(S1) |
For ease of presentation, we use to represent , and the RNN output of is omitted when its meaning can clearly be inferred from the context. We then have
(S2) | ||||
(S3) | ||||
(S4) | ||||
(S5) |
where the equality holds by chain rule [21], denotes a matrix product that the chain rule obeys, and the term (S5) denotes the derivative w.r.t. by fixing and as constants.
(S6) |
Next, we simplify the term (S4). Let . Note that depends on . So we write
(S7) |
Substituting (S6) and (S7) into (S2), we can then express (S1) as:
(S8) |
where is determined by the recursion (S7).
It is clear from (S7) and (S8) that the second order derivative would at most be involved due to the presence of if is specified by the first-order derivative w.r.t. . By contrast, if it is specified by the ZO gradient estimate, then there will only be first-order derivatives involved in (S7) and (S8). Lastly, we remark that the recursive forms of (S7) and (S8) facilitate our computation, and and .
Appendix B Proof of Theorem 1
Before showing the theoretical results, we first give the following a blanket of assumptions.
b.1 Assumptions
In practice, the size of data and variables are limited and the function is also bounded. To proceed, we have the following standard assumptions for quantifying the gradient difference between L2O and LFT.
A1. We assume that gradient estimate is unbiased, i.e.,
(S9) |
where denotes the training/validation data sample of the th task, and stands for .
A2. We assume that the gradient estimate has bounded variance for both , i.e.,
(S10) |
The same assumption is also applied for :
(S11) | |||
(S12) |
A3. We assume that the size of gradient is uniformly upper bounded, i.e., , .
Proof. Assume that A1–A3 hold. Let . From the definitions of and , we have
(S13) | ||||
(S14) | ||||
(S15) | ||||
(S16) | ||||
(S17) | ||||
(S18) | ||||
(S19) |
where in we use Jensen’s inequality, in we apply the triangle inequality, in we use the chain rule, is true because
(S20) | ||||
(S21) | ||||
(S22) |
where in we use Cauchy-Schwarz inequality, in we use Jensen’s inequality; and similarly we have