Nonstationary Nonparametric Online Learning: Balancing Dynamic Regret and Model Parsimony
An open challenge in supervised learning is conceptual drift: a data point begins as classified according to one label, but over time the notion of that label changes. Beyond linear autoregressive models, transfer and meta learning address drift, but require data that is representative of disparate domains at the outset of training. To relax this requirement, we propose a memory-efficient online universal function approximator based on compressed kernel methods. Our approach hinges upon viewing non-stationary learning as online convex optimization with dynamic comparators, for which performance is quantified by dynamic regret.
Prior works control dynamic regret growth only for linear models. In contrast, we hypothesize actions belong to reproducing kernel Hilbert spaces (RKHS). We propose a functional variant of online gradient descent (OGD) operating in tandem with greedy subspace projections. Projections are necessary to surmount the fact that RKHS functions have complexity proportional to time.
For this scheme, we establish sublinear dynamic regret growth in terms of both loss variation and functional path length, and that the memory of the function sequence remains moderate. Experiments demonstrate the usefulness of the proposed technique for online nonlinear regression and classification problems with non-stationary data.
A well-known challenge in supervised learning is conceptual drift: a data point begins as classified according to one label, but over time the notion of that label changes. For example, an autonomous agent classifies the terrain it traverses as grass, but as the sun sets, the grass darkens. The class label has not changed, but the data distribution has. Mathematically, this situation may be encapsulated by supervised learning with time-series data. Classical approaches assume the current estimate depends linearly on its past values, as in autoregressive models [akaike1969fitting], for which parameter tuning is not difficult [brillinger1981time]. While successful in simple settings, these approaches do not apply to classification, alternate quantifiers of model fitness, or universal statistical models such as deep networks [haykin1994neural] or kernel methods [berlinet2011reproducing]. Such modern tools are essential to learning unknown dynamics when assumptions of linear additive Gaussian noise in system identification are invalid, for instance [aastrom1971system, haykin1997adaptive].
In the presence of non-stationarity, efforts to train models beyond linear have focused on recurrent networks [jaeger2002tutorial], but such approaches inherently require the temporal patterns of the past and future to be similar. In contrast, transfer learning seeks to adapt a statistical model trained on one domain to another [pan2010survey], but requires (1) data to be available in advance of training, and (2) a priori knowledge of when domain shifts happen, typically based on hand-crafted features. Meta-learning overcomes the need for hand-crafted statistics of domain shift by collecting experience over disparate domains and discerning decisions that are good with respect to several environments’ training objectives [thrun2012learning]. Combining such approaches with deep networks have yielded compelling results recently [andrychowicz2016learning, finn2017meta], although they still require (1) offline training. Hence, in domains where a priori data collection is difficult, due to, e.g., lack of cloud access or rapid changes in the environment, transfer and meta-learning do not apply. In these instances, online training is required.
For online training, there are two possible approaches to define learning in the presence of non-stationarity: expected risk minimization [Vapnik1995, friedman2001elements], and online convex optimization (OCO) [shalev2012online]. The former approach, due to the fact the data distribution is time-varying distribution, requires the development of stochastic algorithms whose convergence is attuned to temporal aspects of the distribution such as mixing rates [borkar2009stochastic, mohri2010stability]. Although mixing rates are difficult to obtain, they substantially impact performance [nagabandi2018deep]. To mitigate these difficulties, we operate within online convex optimization.
Online convex optimization OCO formulates supervised learning in a distribution-free manner [shalev2012online]. At each time, a learner selects action after which an arbitrary convex cost is evaluated as well as parameters of the cost , i.e., the learner suffers cost . Typically, actions are defined by a parameter vector. In contrast, we hypothesize actions belong to a function space motivated by nonparametric regression whose details will be deferred to later sections [wasserman2006all]. In classic OCO, one compares cost with a single best action in hindsight; however, with non-stationarity, the quintessential quantifier of performance is instead dynamic regret, defined as the cost accumulation as compared with a best action at each time:
OCO concerns the design of methods such that grows sublinearly in horizon for a given sequence , i.e., the average regret goes to null with ( referred to as no-regret [Zinkevich2003]). Observe that , in general, decouples the problem into time-invariant optimization problems since the minimizer is inside the sum. However, in practice, temporal dependence is intrinsic, as in wireless communications [heath1998simple], autonomous path planning [vernaza2008online, turchetta2016safe], or obstacle detection [wurm2010octomap]. Thus, we define (I) in terms of an augmented cost-data pair which arises from several times, either due to new or previously observed pairs . Specifications of to time-windowing or batching are discussed in Sec. II.
|Reference||Regret Notion||Loss||Function Class||Regret Bound|
|[mokhtari2016online, bedi]||Strongly convex||Parametric|
|This Work||Strongly convex||Nonparametric|
I-a Related Work and Contributions
OCO seeks to develop algorithms whose regret grows sublinearly in time horizon . In the static case, the simplest approach is online gradient descent (OGD), which selects the next action to descend along the gradient of the loss at the current time. OGD attains static regret growth when losses are convex [Zinkevich2003] and strongly convex [hazan2007logarithmic], respectively. See Table I for a summary of related works.
The plot thickens when we shift focus to dynamic regret: in particular, [besbes] establishes the impossibility of attaining sublinear dynamic regret, meaning that one cannot track an optimizer varying arbitrarily across time, a fact discerned from an optimization perspective in [simonetto2016class]. Moreover, [besbes] shows that dynamic regret to be an irreducible function of quantifiers of the problem dynamics called the cost function variation and variable variation (definitions in Sec. II). Thus, several works establish sublinear growth of dynamic regret up to factors depending on and , i.e., for OGD or mirror descent with convex losses [Zinkevich2003, hall2015online], more complicated expressions that depend on , the variation of instantaneous gradients [jadbabaie2015online], and for strongly convex losses [mokhtari2016online].
The aforementioned works entirely focus on the case where decisions define a linear model , which, by the estimation-approximation error tradeoff [friedman2001elements], yield small dynamic regret at the cost of large approximation error. Hypothetically, one would like actions to be chosen from a universal function class such as a deep neural network (DNN) [tikhomirov1991representation, scarselli1998universal] or RKHS [park1991universal] while attaining no-regret. It’s well-understood that no-regret algorithms often prescribe convexity of the loss with respect to actions as a prerequisite [shalev2012online], thus precluding the majority of DNN parameterizations. While exceptions to this statement exist [amos2017input], instead we focus on parameterizations defined in nonparametric statistics [wasserman2006all], namely, RKHS [berlinet2011reproducing], due to the fact they yield universality and convexity. Doing so allows us to attain methods that are both no-regret and universal in the non-stationary setting. We note that [shen2019random] considers a similar setting based on random features [rahimi2008random], but its design cannot be tuned to the learning dynamics; and yields faster regret growth.
Contributions We propose a variant of OGD adapted to RKHS. A challenge for this setting is that the function parameterization stores all observations from the past [Kivinen2004], via the Representer Theorem [scholkopfgeneralized]. To surmount this hurdle, we greedily project the functional OGD iterates onto subspaces constructed from subsets of points observed thus far which are -close in RKHS norm (Algorithm 1), as in [koppel2019parsimonious, koppelconsistent], which allows us to explicitly tune the sub-optimality caused by function approximation, in contrast to random feature expansions [rahimi2008random]. Doing so allows us to establish sublinear dynamic regret in terms of both the loss function variation (Theorem 1) and function space path length (Theorem 2). Moreover, the learned functions yield finite memory (Lemma 1). In short, we derive a tunable tradeoff between memory and dynamic regret, establishing for the first time global convergence for a universal function class in the non-stationary regime (up to metrics of non-stationarity [besbes]). These results translate into experiments in which one may gracefully address online nonlinear regression and classification problems with non-stationary data, contrasting alternative kernel methods and other state of the art online learning methods.
Ii Non-Stationary Learning
In this section, we clarify details of the loss, metrics of non-stationarity, and RKHS representations that give rise to the derivation of our algorithms in Sec. III. To begin, we assume Tikhonov regularization, i.e., for some convex function , which links these methods to follow the regularized leader in [shalev2012online].
Time-Windowing and Mini-Batching To address when the solutions are correlated across time or allow for multiple samples per time slot, we define several augmentations of loss data-pairs .
(i) Classical loss: and , and the minimization may be performed over a single datum. In other words, the action taken depends only on the present, as in fading wireless communication channel estimation.
The first cost in (II)(ii) for each time index consists previous cost-data pairs and new cost-data pair , where we denote samples in this time window as . simplifies to dynamic regret as in [shen2019random]. (II) is useful for, e.g., obstacle avoidance, where obstacle is correlated with time. Typically, we distinguish between the sampling rate of a system and the rate at which model updates occur. If one takes samples per update, then mini-batching is appropriate, as in (II)(iii) . In this work, we focus windowing in (II)(ii), i.e., . Further, instead of one point at given by , one may allow points , yielding a hybrid of (II)(ii) - (iii). Our approach naturally extends to mini-batching. For simplicity, we focus on . We denote as the component of (II) without regularization.
Metrics of Non-Stationarity With the loss specified, we shift focus to illuminating the challenges of non-stationarity. As mentioned in Sec. I, [besbes] establishes that designing no-regret [cf. (I)] algorithms against dynamic comparators when cost functions change arbitrarily is impossible. Moreover, dynamic regret is shown to be an irreducible function of fundamental quantifiers of the problem dynamics called cost function variation and variable variation, which we now define. Specifically, the cost function variation tracks the largest loss drift across time:
where for all and denote as the class of convex losses bounded by for any set of points . Further define the variable variation as
which quantifies the drift of the optimal function over time . One may interpret (II) and (4) as the distribution-free analogue of mixing conditions in stochastic approximation with dependent noise in [borkar2006stochastic] and reinforcement learning [karmakar2017two]. Then, our goal is to design algorithms whose growth in dynamic regret (I) is sub-linear, up to constant factors depending on the fundamental quantities (II)-(4).
Iii Algorithm Definition
Reproducing Kernel Hilbert Space With the metrics and motivation clear, we detail the function class that defines how decisions are made. As mentioned in Sec. I, we would like one that satisfies universal approximation theorems [park1991universal], i.e., the hypothesis class containing the Bayes optimal [friedman2001elements], while also permitting the derivation of no-regret algorithms through links to convex analysis. RKHSs [berlinet2011reproducing] meet these specifications, and hence we shift to explaining their properties. A RKHS is a Hilbert space equipped with an inner product-like map called a kernel which satisfies
for all . Common choices include the polynomial kernel and the radial basis kernel, i.e., and , respectively, where . For such spaces, the function that minimizes the sum, , over losses satisfies the Representer Theorem [kimeldorf1971some, scholkopfgeneralized]. Specifically, the optimal may be written as a weighted sum of kernels evaluated only at training examples as , where denotes a set of weights. We define the upper index as the model order.
One may substitute this expression into the minimization of to glean two observations from the use of RKHS in online learning: the latest action is a weighted combination of kernel evaluations at previous points, e.g., a mixture of Gaussians or polynomials centered at previous data ; and that the function’s complexity becomes unwieldy as time progresses, since its evaluation involves all past points. Hence, in the sequel, we must control both the growth of regret and function complexity.
Functional Online Gradient Descent Begin with functional online gradient method, akin to [Kivinen2004]:
where the later equality makes use of the definition of [cf. (II)], the chain rule, and the reproducing property of the kernel (5) – see [Kivinen2004]. We define . Step-size is chosen as a small constant – see Section IV. We require that, given , the step-size satisfies and initialization . Given this initialization, one may apply induction and Representer Theorem [scholkopfgeneralized] to write the function at time as a weighted kernel expansion over past data as
In addition, we need to update the last weights over range to :
Observe that (8) causes to have one more column than . Define the model order as number of points (columns) in the data matrix at time . for OGD, growing unbounded.
Model Order Control via Subspace Projection To overcome the aforementioned bottleneck, we propose projecting the OGD sequence (III) onto subspaces defined by some dictionary , i.e., , inspired by [koppel2019parsimonious]. For convenience we have defined , and as the resulting kernel matrix from this dictionary. We ensure parsimony by ensuring .
Rather than allowing model order of to grow in perpetuity [cf. (8)], we project onto subspaces defined by dictionaries extracted from past data. Deferring the selection of for now, we note it has dimension , with . Begin by considering function is parameterized by dictionary and weight vector . Moreover, we denote columns of as for . We propose a projected variant of OGD:
where we define the projection operator onto subspace by the update (III).
Coefficient update The update (III), for a fixed dictionary , implies an update only on coefficients. To illustrate this point, define the online gradient update without projection, given function parameterized by dictionary and coefficients , as This update may be represented using dictionary and weight vector as
and revising last weights with to , yielding the update for coefficients as
For fixed dictionary , the projection (III) is a least-squares problem on coefficients [williams2001using]:
Given that projection of onto subspace for a fixed dictionary is a simple least-squares multiplication, we turn to explaining the selection of the kernel dictionary from past data .
Dictionary Update One way to obtain the dictionary from , as well as the coefficient , is to apply a destructive variant of kernel orthogonal matching pursuit (KOMP) with pre-fitting [Vincent2002][Sec. 2.3] as in [koppel2019parsimonious]. KOMP operates by beginning with full dictionary and sequentially removing columns while the condition holds. The projected FOGD is defined as:
where is the compression budget which dictates how many model points are thrown away during model order reduction. By design, we have , which allows us tune to only keep dictionary elements critical for online descent directions. These details allow one to implement Dynamic Parsimonious Online Learning with Kernels (DynaPOLK) (Algorithm 1) efficiently. Subsequently, we discuss its theoretical and experimental performance.
Iv Balancing Regret and Model Parsimony
In this section, we establish the sublinear growth of dynamic regret of Algorithm 1 up to factors depending on (4) and the compression budget parameter that parameterizes the algorithm. To do so, some conditions on the loss, its gradient, and the data domain are required which we subsequently state.
The feature space is compact, and the reproducing kernel is bounded:
The loss is uniformly -Lipschitz continuous for all :
The loss is convex and differentiable w.r.t. on for all .
The gradient of the loss is Lipschitz continuous with parameter :
for all and .
Assumption 1 and Assumption 3 are standard [Kivinen2004, dai2014scalable]. Assumptions 2 and 4 ensures the instantaneous loss and its derivative are smooth, which is usual for gradient-based optimization [Bertsekas1999], and holds, for instance, for the square, squared-hinge, or logistic losses. Because we are operating under the windowing framework over last losses (II), we define the Lipschitz constant of as and that of its gradient as . Doing so is valid, as the sum of Lipschitz functions is Lipschitz [rudin1964principles].
Before analyzing the regret of Alg. 1, we discern the influence of the learning rate, compression budget, and problem parameters on the model complexity of the function. In particular, we provide a minimax characterization of the number of points in the kernel dictionary in the following lemma, which determines the required complexity for sublinear dynamic regret growth in different contexts.
Let be the function sequence of Algorithm 1 with step-size and compression . Denote as the model order (no. of columns in dictionary ) of . For a Lipschitz Mercer kernel on compact set , there exists a constant s.t. for data , satisfies
Lemma 1 (proof in Appendix A) establishes that the model order of the learned function is lower bounded by the time-horizon and its upper bound depends on the ratio of the step-size to the compression budget, as well as the Lipschitz constant [cf. (16)]. Next, we shift to characterizing the dynamic regret of Algorithm 1. Our first result establishes that the dynamic regret, under appropriate step-size and compression budget selection, grows sublinearly up to a factor that depends on a batch parameter and the cost function variation (II), and that the model complexity also remains moderate. This result extends [besbes][Proposition 2] to nonparametric settings.
Denote as the sequence generated by Algorithm 1 run for total iterations partitioned into mini-horizons of length . Over mini-horizons, Algorithm 1 is run for steps. Under Assumptions 1-4, the dynamic regret (I) grows with horizon and loss variation (II) as:
which is sublinear for and with mini-horizon , provided . That is, with and , (19) grows sublinearly in and .
Consider the expression for the dynamic regret is given by
Add subtract the term to the right hand side of (20), we obtain
We have utilized the definition of static regret in (97) to obtain (IV). Note that the behavior in terms of static regret of Algorithm 1 is characterized in Theorem 3. To analyze the dynamic regret in terms of , we need to study the different between the static optimal and dynamic optimal given by the second term on the right hand side of (IV). The difference between the two benchmarks (static and dynamic) is determined by the size of and fundamental quantifiers of non-stationarity defined in Section II . To connect (IV) with the loss function variation, following [besbes], we split the interval into equal size batches with each of size except the last batch given by for where . We can rewrite the expression in (IV) as follows
where we define for all , and note that the outer sum over indexes the batch number, whereas inner one indexes elements of a particular batch . The expression for the dynamic regret in (IV) is decomposed into two sums. Note that the first sum represents the sum of the regrets against a single batch action for each batch . The second term in (IV) quantifies the non-stationarity of the optimizer: it is a sum over differences between the best action over batch and corresponding dynamic optimal actions. Next, we bound the each term on the right hand side of (IV) separately. From the static regret in (111), it holds that
for all . To upper bound the term in (IV) associated with non-stationarity, i.e., the second term on the right-hand side, by definition of the minimum, we have
Next, we need to upper bound the right hand side of (25) in terms if the loss function variation budget . To do that, let us first define the loss function variation over each batch as follows
and note that . With this definition, we now show that
by contradiction. Let us assume that the inequality in (27) in not true which means that there is at least one epoch, say , for which the following property is valid:
Since is the maximal variation for batch , it holds that
for all . The second inequality in (IV) holds by dropping the negative terms. We note that the inequality in (IV) is a contradiction for , since a positive number cannot be less than itself. Therefore, the hypothesis in (28) is invalid, which implies that (27) holds true. Next, we utilize the upper bound in (27) to the right hand side of (25), we get
Suppose we make the parameter selections
with . Then the right-hand side of (IV) takes the form
with model order by substituting (33) into the result of Lemma 1. For the dynamic regret to be sublinear, we need and . As long as the dimension is not too large, we always have a range for . This implies that and hence is sublinear. One specification of that satisfies this range is and , as stated in Theorem 1. We obtain the result presented in Table I for the selection and . ∎
The batch parameter tunes static versus non-stationary performance: for large , then the algorithm attains smaller regret with respect to the static oracle, i.e., the first terms on the right-hand side of (19), but worse in terms of the non-stationarity as quantified by function variation , the last term. On the other hand, if the batch size is smaller, we do worse in terms of static regret terms but better in terms of non-stationarity. This contrasts with the parametric setting as well [besbes]: the term appears due to the compression-induced error.
Up to now, we quantified algorithm performance by loss variation (II); however, this is only a surrogate for the performance of the sequence of time-varying optimizers (4), which is fundamental in time-varying optimization [simonetto2016class, simonetto2017time], and may be traced to functions of bounded variation in real analysis [rudin1964principles]. Thus, we shift focus to analyzing Algorithm 1 in terms of this fundamental performance metric.
First, we note that the path length (4) is unique when losses are strongly convex. On the other hand, when costs are non-strongly convex, then (4) defines a set of optimizers. Thus, these cases must be treated separately. First, we introduce an assumption used in the second part of the upcoming theorem.
The instantaneous loss is strongly convex with parameter :
for all and any functions .
With the technical setting clarified, we may now present the main theorem regarding dynamic regret in terms of path length (4).
Denote as the function sequence generated by Algorithm 1 run for iterations. Under Assumptions 1-4, with regularization the following dynamic regret bounds hold in terms of path length (4) and compression budget :
when costs are convex, regret is sublinear with and for any with , we have
Alternatively, if the cost functions are strongly convex, i.e., Assumption 5 holds, with and for any with , we have
where is a contraction constant for a given .
From the inequality in (B-A), we have
For a Lipschitz continuous gradient function [Assumption 4] with , we have
which implies that