No-regret Non-convex Online Meta-Learning

# No-regret Non-convex Online Meta-Learning

## Abstract

The online meta-learning framework is designed for the continual lifelong learning setting. It bridges two fields: meta-learning which tries to extract prior knowledge from past tasks for fast learning of future tasks, and online-learning which deals with the sequential setting where problems are revealed one by one. In this paper, we generalize the original framework from convex to non-convex setting, and introduce the local regret as the alternative performance measure. We then apply this framework to stochastic settings, and show theoretically that it enjoys a logarithmic local regret, and is robust to any hyperparameter initialization. The empirical test on a real-world task demonstrates its superiority compared with traditional methods.

\name

Zhenxun Zhuang , Yunlong Wang ,Kezi Yu , Songtao Lu 1 \address Department of Computer Science, Boston University, Boston, MA 02215
Advanced Analytics, IQVIA, Plymouth Meeting, PA 19462
IBM Research AI, IBM Thomas J. Waston Research Center, NY 10598 \ninept {keywords} Meta learning, online learning, non-convex optimization

## 1 Introduction

In recent years, high-capacity machine learning models, such as deep neural networks [1], have achieved remarkable successes in various domains [2, 3, 4]. However, domains where data is scarce remain a big challenge as those models’ ability to learn and generalize relies heavily on the abundance of training data. In contrast, humans can learn new skills and concepts very efficiently from just a few experiences. This is because when encountering a new task, learning algorithms start completely from scratch; while humans are typically armed with plenty of prior knowledge accumulated from past experience which may share overlapping structures with the current task, and thus can enable efficient learning of the new task.

Meta-learning [5, 6, 7] was designed to mimic this human ability. A meta-learning algorithm is first given a set of meta-training tasks assumed to be drawn from some distribution, and attempts to extract prior knowledge applicable to all tasks in the form of a meta-learner. This meta-learner is then evaluated on an unseen task, usually assumed to be drawn from a similar distribution as the one for training. Recent years have seen a surge of interests in this field resulting in numerous achievements, among which a seminal work is the gradient-based algorithm: MAML [8]. Due to its simplicity yet great efficiency and generality, it has initiated a fruitful line of research [9, 10, 11]. However, like other meta-learning algorithms, it assumes all meta-training tasks are available together as a batch, which doesn’t capture the sequential setting of continual lifelong learning in which new tasks are revealed one after another.

Meanwhile, online learning [12] specifically tackles the sequential setting. At each round , one picks an , and suffers a loss revealed by a potentially adversarial environment. The goal is to minimize the regret, the difference between the cumulative losses suffered by the algorithm and that of any fixed predictor, formally:

 RegretT(x):=T∑t=1ft(xt)−T∑t=1ft(x) . (1)

Yet, online learning sees the whole process as a single task without adaptation for each single step.

Neither paradigm alone is ideal for the continual lifelong learning scenario, thus, Finn et al. [13] proposed to combine them together to construct the Online Meta-Learning framework which will be discussed in Section 2. However, this framework has a strong convexity assumption, while many problems of current interest have a non-convex nature. Thus, in Section 3, we generalize this framework to the non-convex setting. Section 4 presents an exemplification of our algorithm with rigorous theoretical proofs of its performance guarantee. Real data experiment results are shown in Section 5. In the end, concluding remarks and takeaways are provided in Section 6. To the best of our knowledge, it is the first theoretical regret analysis for non-convex online meta-learning algorithms, shedding the light of applying online meta-learning for more challenging learning problems in the paradigm of deep neural networks.

Notation. We use bold letters to denote vectors, e.g., . The th coordinate of a vector is . Unless explicitly noted, we study the Euclidean space with the inner product , and the Euclidean norm. We assume everywhere our objective function is bounded from below and denote the infimum by . The gradient of a function at is . means the expectation w.r.t. the underlying probability distribution of a random variable .

## 2 Background

Algorithm 1 is the online meta-learning framework proposed in [13]. A meta-learner is maintained to preserve the prior knowledge learned from past rounds. For each new task , one is first given some training data for adapting to the current task following some strategy . Then the test data will be revealed for evaluating the performance of the adapted learner . The loss suffered at this round can then be fed into an online learning algorithm to update . We use following [13] where is the step-size.

As tasks can be very different, the original regret in Equation (1) of competing with a fixed learner across all tasks becomes less meaningful. Thus, Finn et al. [13] changed it to:

 Regret′T(w)=T∑t=1ℓ(U(wt,Dtrt),Dtst)−T∑t=1ℓ(U(w,Dtrt),Dtst) ,

which competes with any fixed meta-learner. Under this, they designed the Follow the Meta Leader algorithm enjoying a logarithmic regret when assuming strong-convexity on .

## 3 Problem Formulation

In this section, we generalize the online meta-learning algorithm to non-convex setting by first demonstrating the infeasibility of regret of form (1) and then introducing an alternative performance measure.

Finding the global minimum for a non-convex function in general is known to be NP-hard. Yet, if we could find an online learning algorithm with a regret for some non-convex function classes, we can optimize any function of that class efficiently: simply run the online learning algorithm but with the objective as the loss at each round, and choose a random update as output. This gives us:

 Ei[f(wi)]−minw∈Kf(w)=1TT∑t=1f(wt)−minw∈Kf(w) oghaooo=1TT∑t=1ℓt(wt)−minw∈K1TT∑t=1ℓt(w)∈o(1) ,

which leads to a contradiction unless P=NP. Thus, we have to find another performance measure for the non-convex case. One potential candidate is the local regret proposed by Hazan et al. [14]:

 Rm(T)≜T∑t=1∥∇Ft,m(wt)∥2 , (2)

where , , and for . The reason for using sliding-window in , especially a large window, can be justified by Theorem 2.7 in [14].

## 4 Algorithm & Theoretical Guarantees

### 4.1 Stochasticity of Online Meta-learning Algorithms

In practice, is typically just a random sample batch of the whole test-set, the losses and gradients obtained at each round are thus (unbiased) estimates of the true ones. This is the stochastic setting which we formalize by making following assumptions.

###### Assumption 1.

We assume that at each round , each call to any stochastic gradient oracle , , yields an i.i.d. random vector with the following properties:

1.  ;

2.  ;

3. Mutual independence: for ,

 Eξt,i,ξt,j[⟨gi(wt,ξt,i), gj(wt,ξt,j)⟩|ξ1:t−1]= ⟨Eξt,i[gi(wt,ξt,i)|ξ1:t−1], Eξt,j[gj(wt,ξt,j)|ξ1:t−1]⟩ .

where , and denotes the conditional expectation of with respect to . Also note that for .

Hazan et al. proposed a time-smoothed online gradient descent algorithm [14] for such case. Yet, that algorithm’s performance critically relies on the choice of the step-size , and may even diverge if where is the (often unknown) smoothness of the loss function. We thus propose to use the AdaGrad-Norm [15] algorithm (Algorithm 2) as the online learning algorithm in Algorithm 1 instead. Here, is the initialization of the accumulated squared norms and prevents division by 0, while is to ensure homogeneity and that the units match.

### 4.2 Regret Analysis

We present below an analysis of this algorithm assuming the loss function satisfies:

###### Assumption 2.

is twice differentiable and :

1. -Lipschitz:  .

2. -smooth:  .
Note that this implies [16, Lemma 1.2.3]:

 |ℓ(v)−ℓ(u)−⟨∇ℓ(u),v−u⟩|≤β2∥v−u∥2 . (3)
3. -Hessian-Lipschitz:  .

4. -Bounded:

Under Assumption 2 of , we can derive the following properties of (the proof can be found in the Appendix):

###### Lemma 1.

Assuming Assumption 2 holds, is -Bounded, -Lipschitz, and -smooth.

The following theorem shows that by selecting , a logarithmic regret of the algorithm is guaranteed w.r.t. any .

###### Theorem 1.

Let satisfy Assumptions 2. Then, feeding Algorithm 2 into Algorithm 1 with access to stochastic gradient oracles satisfying Assumptions 1 gives the following upper bound of , with probability :

 Rm(T)≤48C2δ2+8b1Cδ+8σC√Tδ3/2√m ,

where  .

Before showing the proof of Theorem 1, we need the following technical lemmas whose proofs can be found in the Appendix. For simplicity, we denote as condition on and take expectation w.r.t. :

###### Lemma 2.

As , and , Assumption 1 gives us:

###### Lemma 3.

Given Assumption 2(d), we have: .

###### Lemma 4 ([17], Lemma 9).

Let be a nonincreasing function, and for . Then

 T∑t=1ath(a0+t∑i=1ai) ≤∫∑Tt=0ata0h(x)dx .
###### Proof of Theorem 1.

The proof follows that of Theorem 2.1 in [15].

First, as the average of -smooth functions, is also -smooth. Using the property in Assumption 2(b) and the update formula (Line 5) in Algorithm 2 we have:

 Ft,m(wt+1)−Ft,m(wt)η ≤ −⟨∇Ft,m(wt),Gt,m(wt)bt+1⟩+ηβ′2b2t+1∥Gt,m(wt)∥2 .

Denote , and take expectation w.r.t.  conditioned on (namely ) :

 Et[Ft,m(wt+1)−Ft,m(wt)]η (4) ≤ Et[(1~bt+1−1bt+1)⟨∇Ft,m(wt),Gt,m(wt)⟩] (5) −∥∇Ft,m(wt)∥2~bt+1+ηβ′2Et[1b2t+1∥Gt,m(wt)∥2] . (6)

Second, from the definition of and we have:

 ∣∣∣1~bt+1−1bt+1∣∣∣ =∣∣∥Gt,m(wt)∥2−∥∇Ft,m(wt)∥2−σ2/m∣∣bt+1~bt+1(bt+1+~bt+1) ≤|∥Gt,m(wt)∥−∥∇Ft,m(wt)∥|bt+1~bt+1+σ/√mbt+1~bt+1 .

Using this, and Jensen’s inequality on which is a convex function, we can upper-bound Equation (5) by its absolute value which in turn can be upper-bounded by:

 Et[|∥Gt,m(wt)∥−∥∇Ft,m(wt)∥|∥Gt,m(wt)∥∥∇Ft,m(wt)∥]bt+1~bt+1 (7) +Et[∥Gt,m(wt)∥∥∇Ft,m(wt)∥σ/√m]bt+1~bt+1 . (8)

Third, by using inequality with , , Equation (7) can be upper bounded by:

 ∥∇Ft,m(wt)∥24~bt+1+σ√mEt[∥Gt,m(wt)∥2b2t+1] ,

where we used that holds for .

Applying again but with , , we can upper bound eq. (8) by:

 ∥∇Ft,m(wt)∥24~bt+1+σ√mEt[∥Gt,m(wt)∥2b2t+1] .

Fourth, putting above two inequalities back, and then in turn putting the result back into Equation (5) give us:

 Et[Ft,m(wt+1)]−Ft,m(wt)η ≤ −∥∇Ft,m(wt)∥22~bt+1+(ηβ′2+2σ√m)Et[1b2t+1∥Gt,m(wt)∥2] .

Rearrange terms, then for both sides, take expectation w.r.t.  and sum from to :

 T∑t=1E[∥∇Ft,m(wt)∥22~bt+1] ≤ ∑Tt=1[E[Ft,m(wt)]−E[Ft,m(wt+1)]]η (9) +(ηβ′+4σ/√m2)ET∑t=1∥Gt,m(wt)∥2b2t+1 . (10)

As , letting be in Lemma 4 gives us:

 E[T∑t=1∥Gt,m(wt)∥2b2t+1]≤ln⎛⎝1+∑Tt=1E[∥Gt,m(wt)∥2]b21⎞⎠ ,

where we used Jensen’s inequality for which is a concave function in .

Since each is -Lipschitz, so is , thus, using Cauchy-Schwartz inequality:

 E[∥Gt,m(wt)∥2] ≤2E[∥Gt,m(wt)−∇Ft,m(wt)∥2] ≤+2E[∥∇Ft,m(wt)∥2] (11) ≤2(σ2/m+L′2) .

Putting the above inequality back into Equation (10) and Lemma 3 back into Equation (9), we have:

 T∑t=1E[∥∇Ft,m(wt)∥22~bt+1]≤ 4MTηm +(ηβ′+4σ/√m2) ln(1+2(σ2/m+L′2)Tb21) . (12)

Finally, using Markov’s inequality, with probability , Lemma 2(b) gives us:

 T∑t=1∥∇Ft,m(wt)−Gt,m(wt)∥2≤Tσ2mδ1 .

Denote . Using similar derivation in Equation (11), with probability we have:

 b2T+∥∇FT,m(wT)∥2+σ2/m≤ b21+2Z+2Tσ2mδ1

This means, with probability , we have:

 T∑t=1∥∇Ft,m(wt)∥22~bt+1 ≥∑Tt=1∥∇Ft,m(wt)∥22√b2T+∥∇FT,m(wT)∥2+Z+σ2/m ≥∑Tt=1∥∇Ft,m(wt)∥22√b21+3Z+2Tσ2mδ1 .

Denote the right-hand side of Equation (12) as , and use Markov’s inequality again we have, with probability :

 T∑t=1∥∇Ft,m(wt)∥22~bt+1≤Cδ2 .

Therefore, with probability , we have

 Z2√b21+3Z+2Tσ2mδ1≤Cδ2 .

By solving the above “quadratic” inequality of and letting , we arrive at the end. ∎

## 5 Experiment

We evaluated our algorithm on the few-shot image classification task of the Omniglot [18] dataset which consists of 20 instances of 1623 characters from 50 different alphabets. The dataset is augmented with rotations by multiples of 90 degrees following [19].

We employed the -way -shot protocol [7]: at each round, pick unseen characters irrespective of alphabets. Provide the meta-learner with different drawings of each of the characters as the training set , then evaluate the adapted model ’s ability on new unseen instances within the classes (namely the test set ). We chose the 5-way 5-shot scheme, and used 15 samples per character for testing following [20].

The model we used is a CNN following [7]. It contains 4 modules, each of which is a 33 convolution with 64 filters followed by batch normalization [21], a ReLu non-linearity and 22 max-pooling. Images are downsampled to 2828 so that the resulting feature map of the last hidden layer is 1164. The last layer is fed into a fully connected layer and the loss we used is the Cross-Entropy loss.

To study if our algorithm provides any empirical benefit over traditional methods, we compare it to two benchmark algorithms [13]: Train on Everything (TOE), and Train from Scratch (TFS). On each round , both initialize a new model. The difference is that TOE trains over all available data, both training and testing, from all past tasks, plus at current round, while TFS only uses for training.

The experiments are performed in PyTorch [22], and parameters are by default if no specification is provided. For the parameter in the local adapter strategy in Algorithm 1, we set it to be 0.1 everywhere, and the gradient descent step is performed only once for each task. For the AdaGrad-Norm algorithm (Algorithm 2) we used, we set as suggested in the original paper [15]. The TFS and TOE used Adam [23] with default parameters.

The result is shown in Figure 1 which suggests that our algorithm gradually accumulates prior knowledge, which enables fast learning of later tasks. TFS provides a good example of how CNN performs when the training data is scarse. On the contrary, TOE behaves nearly as random guessing. The inferiority of TOE to TFS is somehow surprising, as TOE has much more training data than TFS. The reason is that TOE regards all training data as coming from a single distribution, and tries to learn a model that works for all tasks. Thus, when tasks are substantially different from each other, TOE might even incur negative transfer and fail to solve any single task as has been observed in [24]. Meanwhile, by using training data of the current task only, TFS avoids negative transfer, but also rules out learning of any connection between tasks. Our algorithm, in contrast, is designed to discover common structures across tasks, and use these information to guide fast adaptation to new tasks.

## 6 Conclusion

The continual lifelong learning problem is common in real-life, where an agent needs to accumulate knowledge from every task it encounters, and utilize that knowledge for fast learning of new tasks. To solve this problem, we can combine the meta-learning and the online-learning paradigms to form the online meta-learning framework. In this work, we generalized this framework to the non-convex setting, and introduced the local regret to replace the original regret definition. We applied it to the stochastic setting, and showed its superiority both in theory and practice. In the future work, we would like to evaluate our algorithm on harder learning problems over larger scale datasets.

## Appendix A Appendix

### a.1 Proof of Lemma 1

Lemma 1. Assuming Assumption 2, is -Bounded, -Lipschitz, and -smooth.

###### Proof.

We first write out the complete formula of :

 ℓt(w) =ℓ(^w,Dtst) =Exts,yts∼Dtst[ℓ(U(w,Dtrt),xts;yts)] =Exts,yts∼Dtst[ℓ(w−α∇Extr,ytr∼Dtrt[ℓ(w,xtr;ytr)],xts;yts)] ≜ft(w−α∇^ft(w)) .

The -Boundedness is straight-forward.

To show the Lipschitzness, we derive :

 ∇ℓt(w)=(I−α∇2^ft(w))∇ft(w−α∇^ft(w)) .

Note that and both share the properties of , thus, from Assumption 2(a,b), we have:

 ∥∇ℓt(w)∥≤(1+αβ)∥∇ft(w−α∇^ft(w))∥≤(1+αβ)L .

Next, denoting as , we have :

 ==∥∇ℓt(u)−∇ℓt(v)∥ =∥∇Ut(u)∇ft(Ut(u))−∇Ut(v)∇ft(Ut(v))∥ =∥∇Ut(u)∇ft(Ut(u))−∇Ut(v)∇ft(Ut(u))+∇Ut(v)∇ft(Ut(u))−∇Ut(v)∇ft(Ut(v))∥ ≤∥(∇Ut(u)−∇Ut(v))∇ft(Ut(u))∥+∥∇Ut(v)(∇ft(Ut(u))−∇ft(Ut(v)))∥ =α∥(∇2^ft(u)−∇2^ft(v))∇ft(Ut(u))∥+∥(I−α∇2^ft(v))(∇ft(Ut(u))−∇ft(Ut(v)))∥ ≤αLH∥u−v∥+(1+αβ)∥∇ft(Ut(u))−∇ft(Ut(v))∥ ≤αLH∥u−v∥+(1+αβ)β∥Ut(u)−Ut(v)∥ ≤αLH∥u−v∥+(1+αβ)2β∥u−v∥ ,

where the first inequality uses the triangle inequality of a norm; the second inequality uses the smoothness and hessian-Lipschitzness assumptions; the third inequality uses the smoothness assumption.

We are left to prove the last inequality:

 ∥Ut(u)−Ut(v)∥ =∥u−α∇^ft(u)−v+α∇^ft(v)∥ =∥u−v−α(∇^ft(u)−∇^ft(v))∥ ≤∥u−v∥+α∥∇^ft(u)−∇^ft(v)∥ ≤(1+αβ)∥u−v∥ ,

where the the first inequality uses the triangle inequality of a norm, and the second inequality uses the smoothness assumption. ∎

### a.2 Proof of Lemma 2

Lemma 2. As , and , Assumption 1 gives us:

###### Proof.

Note that denotes conditioning on and take expectation w.r.t. .

In Assumption 1(a) we assume for , the linearity of expectation immediately gives us .

To see the second part, we only need to expand as:

 1m2Et⎡⎣∥∥ ∥∥m−1∑i=0gt−i(wt,ξt,t−i)−∇ℓt−i(wt)∥∥ ∥∥2⎤⎦ = 1m2m−1∑i=0m−1∑j=0Et[⟨gt−i(wt,ξt,t−i)−∇ℓt−i(wt), gt−j(wt,ξt,t−j)−∇ℓt−j(wt)⟩] =

Each item of the first part in the last equation can be bounded by according to Assumption 1(b), which leads to a overall upper-bound.

For the second part, we need to use the Mutual Independence assumption (namely Assumption 1(c)):

 Et[⟨gt−i(wt,ξt,t−i)−∇ℓt−i(wt), gt−j(wt,ξt,t−j)−∇ℓt−j(wt)⟩] = ⟨Et[gt−i(wt,ξt,t−i)−∇ℓt−i(wt)], Et[gt−j(wt,ξt,t−j)−∇ℓt−j(wt)⟩] .

Use Assumption 1(a) again we know that the above equation equals to 0. This proves part (b) of this lemma. ∎

### a.3 Proof of Lemma 3

Lemma 3. Given Assumption 2(d), we have: .

###### Proof.
 T∑t=1E[Ft,m(wt)−Ft,m(wt+1)] = T∑t=2E[Ft,m(wt)−Ft−1,m(wt)]+F1,m(w1)−E[FT,m(wT+1)] = T∑t=21mm−1∑i=0E[ℓt−i(wt)−ℓt−1−i(wt)]+ℓ1(w1)−1mm−1∑i=0E[ℓT−i(wT+1)] = T∑t=21mE[ℓt(wt)−ℓt−m(wt)]+ℓ1(w1)−1mm−1∑i=0