Noregret Nonconvex Online MetaLearning
Abstract
The online metalearning framework is designed for the continual lifelong learning setting. It bridges two fields: metalearning which tries to extract prior knowledge from past tasks for fast learning of future tasks, and onlinelearning which deals with the sequential setting where problems are revealed one by one. In this paper, we generalize the original framework from convex to nonconvex setting, and introduce the local regret as the alternative performance measure. We then apply this framework to stochastic settings, and show theoretically that it enjoys a logarithmic local regret, and is robust to any hyperparameter initialization. The empirical test on a realworld task demonstrates its superiority compared with traditional methods.
Zhenxun Zhuang , Yunlong Wang ,Kezi Yu , Songtao Lu
Advanced Analytics, IQVIA, Plymouth Meeting, PA 19462
IBM Research AI, IBM Thomas J. Waston Research Center, NY 10598
\ninept
{keywords}
Meta learning, online learning, nonconvex optimization
1 Introduction
In recent years, highcapacity machine learning models, such as deep neural networks [1], have achieved remarkable successes in various domains [2, 3, 4]. However, domains where data is scarce remain a big challenge as those models’ ability to learn and generalize relies heavily on the abundance of training data. In contrast, humans can learn new skills and concepts very efficiently from just a few experiences. This is because when encountering a new task, learning algorithms start completely from scratch; while humans are typically armed with plenty of prior knowledge accumulated from past experience which may share overlapping structures with the current task, and thus can enable efficient learning of the new task.
Metalearning [5, 6, 7] was designed to mimic this human ability. A metalearning algorithm is first given a set of metatraining tasks assumed to be drawn from some distribution, and attempts to extract prior knowledge applicable to all tasks in the form of a metalearner. This metalearner is then evaluated on an unseen task, usually assumed to be drawn from a similar distribution as the one for training. Recent years have seen a surge of interests in this field resulting in numerous achievements, among which a seminal work is the gradientbased algorithm: MAML [8]. Due to its simplicity yet great efficiency and generality, it has initiated a fruitful line of research [9, 10, 11]. However, like other metalearning algorithms, it assumes all metatraining tasks are available together as a batch, which doesn’t capture the sequential setting of continual lifelong learning in which new tasks are revealed one after another.
Meanwhile, online learning [12] specifically tackles the sequential setting. At each round , one picks an , and suffers a loss revealed by a potentially adversarial environment. The goal is to minimize the regret, the difference between the cumulative losses suffered by the algorithm and that of any fixed predictor, formally:
(1) 
Yet, online learning sees the whole process as a single task without adaptation for each single step.
Neither paradigm alone is ideal for the continual lifelong learning scenario, thus, Finn et al. [13] proposed to combine them together to construct the Online MetaLearning framework which will be discussed in Section 2. However, this framework has a strong convexity assumption, while many problems of current interest have a nonconvex nature. Thus, in Section 3, we generalize this framework to the nonconvex setting. Section 4 presents an exemplification of our algorithm with rigorous theoretical proofs of its performance guarantee. Real data experiment results are shown in Section 5. In the end, concluding remarks and takeaways are provided in Section 6. To the best of our knowledge, it is the first theoretical regret analysis for nonconvex online metalearning algorithms, shedding the light of applying online metalearning for more challenging learning problems in the paradigm of deep neural networks.
Notation. We use bold letters to denote vectors, e.g., . The th coordinate of a vector is . Unless explicitly noted, we study the Euclidean space with the inner product , and the Euclidean norm. We assume everywhere our objective function is bounded from below and denote the infimum by . The gradient of a function at is . means the expectation w.r.t. the underlying probability distribution of a random variable .
2 Background
Algorithm 1 is the online metalearning framework proposed in [13]. A metalearner is maintained to preserve the prior knowledge learned from past rounds. For each new task , one is first given some training data for adapting to the current task following some strategy . Then the test data will be revealed for evaluating the performance of the adapted learner . The loss suffered at this round can then be fed into an online learning algorithm to update . We use following [13] where is the stepsize.
As tasks can be very different, the original regret in Equation (1) of competing with a fixed learner across all tasks becomes less meaningful. Thus, Finn et al. [13] changed it to:
which competes with any fixed metalearner. Under this, they designed the Follow the Meta Leader algorithm enjoying a logarithmic regret when assuming strongconvexity on .
3 Problem Formulation
In this section, we generalize the online metalearning algorithm to nonconvex setting by first demonstrating the infeasibility of regret of form (1) and then introducing an alternative performance measure.
Finding the global minimum for a nonconvex function in general is known to be NPhard. Yet, if we could find an online learning algorithm with a regret for some nonconvex function classes, we can optimize any function of that class efficiently: simply run the online learning algorithm but with the objective as the loss at each round, and choose a random update as output. This gives us:
which leads to a contradiction unless P=NP. Thus, we have to find another performance measure for the nonconvex case. One potential candidate is the local regret proposed by Hazan et al. [14]:
(2) 
where , , and for . The reason for using slidingwindow in , especially a large window, can be justified by Theorem 2.7 in [14].
4 Algorithm & Theoretical Guarantees
4.1 Stochasticity of Online Metalearning Algorithms
In practice, is typically just a random sample batch of the whole testset, the losses and gradients obtained at each round are thus (unbiased) estimates of the true ones. This is the stochastic setting which we formalize by making following assumptions.
Assumption 1.
We assume that at each round , each call to any stochastic gradient oracle , , yields an i.i.d. random vector with the following properties:

;

;

Mutual independence: for ,
where , and denotes the conditional expectation of with respect to . Also note that for .
Hazan et al. proposed a timesmoothed online gradient descent algorithm [14] for such case. Yet, that algorithm’s performance critically relies on the choice of the stepsize , and may even diverge if where is the (often unknown) smoothness of the loss function. We thus propose to use the AdaGradNorm [15] algorithm (Algorithm 2) as the online learning algorithm in Algorithm 1 instead. Here, is the initialization of the accumulated squared norms and prevents division by 0, while is to ensure homogeneity and that the units match.
4.2 Regret Analysis
We present below an analysis of this algorithm assuming the loss function satisfies:
Assumption 2.
is twice differentiable and :

Lipschitz: .

smooth: .
Note that this implies [16, Lemma 1.2.3]:(3) 
HessianLipschitz: .

Bounded:
Under Assumption 2 of , we can derive the following properties of (the proof can be found in the Appendix):
Lemma 1.
Assuming Assumption 2 holds, is Bounded, Lipschitz, and smooth.
The following theorem shows that by selecting , a logarithmic regret of the algorithm is guaranteed w.r.t. any .
Theorem 1.
Before showing the proof of Theorem 1, we need the following technical lemmas whose proofs can be found in the Appendix. For simplicity, we denote as condition on and take expectation w.r.t. :
Lemma 2.
As , and , Assumption 1 gives us:
Lemma 3.
Given Assumption 2(d), we have: .
Lemma 4 ([17], Lemma 9).
Let be a nonincreasing function, and for . Then
Proof of Theorem 1.
The proof follows that of Theorem 2.1 in [15].
First, as the average of smooth functions, is also smooth. Using the property in Assumption 2(b) and the update formula (Line 5) in Algorithm 2 we have:
Denote , and take expectation w.r.t. conditioned on (namely ) :
(4)  
(5)  
(6) 
Second, from the definition of and we have:
Using this, and Jensen’s inequality on which is a convex function, we can upperbound Equation (5) by its absolute value which in turn can be upperbounded by:
(7)  
(8) 
Third, by using inequality with , , Equation (7) can be upper bounded by:
where we used that holds for .
Applying again but with , , we can upper bound eq. (8) by:
Fourth, putting above two inequalities back, and then in turn putting the result back into Equation (5) give us:
Rearrange terms, then for both sides, take expectation w.r.t. and sum from to :
(9)  
(10) 
As , letting be in Lemma 4 gives us:
where we used Jensen’s inequality for which is a concave function in .
Since each is Lipschitz, so is , thus, using CauchySchwartz inequality:
(11)  
Putting the above inequality back into Equation (10) and Lemma 3 back into Equation (9), we have:
(12) 
Finally, using Markov’s inequality, with probability , Lemma 2(b) gives us:
Denote . Using similar derivation in Equation (11), with probability we have:
This means, with probability , we have:
Denote the righthand side of Equation (12) as , and use Markov’s inequality again we have, with probability :
Therefore, with probability , we have
By solving the above “quadratic” inequality of and letting , we arrive at the end. ∎
5 Experiment
We evaluated our algorithm on the fewshot image classification task of the Omniglot [18] dataset which consists of 20 instances of 1623 characters from 50 different alphabets. The dataset is augmented with rotations by multiples of 90 degrees following [19].
We employed the way shot protocol [7]: at each round, pick unseen characters irrespective of alphabets. Provide the metalearner with different drawings of each of the characters as the training set , then evaluate the adapted model ’s ability on new unseen instances within the classes (namely the test set ). We chose the 5way 5shot scheme, and used 15 samples per character for testing following [20].
The model we used is a CNN following [7]. It contains 4 modules, each of which is a 33 convolution with 64 filters followed by batch normalization [21], a ReLu nonlinearity and 22 maxpooling. Images are downsampled to 2828 so that the resulting feature map of the last hidden layer is 1164. The last layer is fed into a fully connected layer and the loss we used is the CrossEntropy loss.
To study if our algorithm provides any empirical benefit over traditional methods, we compare it to two benchmark algorithms [13]: Train on Everything (TOE), and Train from Scratch (TFS). On each round , both initialize a new model. The difference is that TOE trains over all available data, both training and testing, from all past tasks, plus at current round, while TFS only uses for training.
The experiments are performed in PyTorch [22], and parameters are by default if no specification is provided. For the parameter in the local adapter strategy in Algorithm 1, we set it to be 0.1 everywhere, and the gradient descent step is performed only once for each task. For the AdaGradNorm algorithm (Algorithm 2) we used, we set as suggested in the original paper [15]. The TFS and TOE used Adam [23] with default parameters.
The result is shown in Figure 1 which suggests that our algorithm gradually accumulates prior knowledge, which enables fast learning of later tasks. TFS provides a good example of how CNN performs when the training data is scarse. On the contrary, TOE behaves nearly as random guessing. The inferiority of TOE to TFS is somehow surprising, as TOE has much more training data than TFS. The reason is that TOE regards all training data as coming from a single distribution, and tries to learn a model that works for all tasks. Thus, when tasks are substantially different from each other, TOE might even incur negative transfer and fail to solve any single task as has been observed in [24]. Meanwhile, by using training data of the current task only, TFS avoids negative transfer, but also rules out learning of any connection between tasks. Our algorithm, in contrast, is designed to discover common structures across tasks, and use these information to guide fast adaptation to new tasks.
6 Conclusion
The continual lifelong learning problem is common in reallife, where an agent needs to accumulate knowledge from every task it encounters, and utilize that knowledge for fast learning of new tasks. To solve this problem, we can combine the metalearning and the onlinelearning paradigms to form the online metalearning framework. In this work, we generalized this framework to the nonconvex setting, and introduced the local regret to replace the original regret definition. We applied it to the stochastic setting, and showed its superiority both in theory and practice. In the future work, we would like to evaluate our algorithm on harder learning problems over larger scale datasets.
Appendix A Appendix
a.1 Proof of Lemma 1
Lemma 1. Assuming Assumption 2, is Bounded, Lipschitz, and smooth.
Proof.
We first write out the complete formula of :
The Boundedness is straightforward.
To show the Lipschitzness, we derive :
Note that and both share the properties of , thus, from Assumption 2(a,b), we have:
Next, denoting as , we have :
where the first inequality uses the triangle inequality of a norm; the second inequality uses the smoothness and hessianLipschitzness assumptions; the third inequality uses the smoothness assumption.
We are left to prove the last inequality:
where the the first inequality uses the triangle inequality of a norm, and the second inequality uses the smoothness assumption. ∎
a.2 Proof of Lemma 2
Lemma 2. As , and , Assumption 1 gives us:
Proof.
Note that denotes conditioning on and take expectation w.r.t. .
In Assumption 1(a) we assume for , the linearity of expectation immediately gives us .
To see the second part, we only need to expand as:
Each item of the first part in the last equation can be bounded by according to Assumption 1(b), which leads to a overall upperbound.
For the second part, we need to use the Mutual Independence assumption (namely Assumption 1(c)):
Use Assumption 1(a) again we know that the above equation equals to 0. This proves part (b) of this lemma. ∎
a.3 Proof of Lemma 3
Lemma 3. Given Assumption 2(d), we have: .