Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity

Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity

Abstract

In this paper, we develop a new sequential regression modeling approach for data streams. Data streams are commonly found around us, e.g in a retail enterprise sales data is continuously collected every day. A demand forecasting model is an important outcome from the data that needs to be continuously updated with the new incoming data. The main challenge in such modeling arises when there is a) high dimensional and sparsity, b) need for an adaptive use of prior knowledge, and/or c) structural changes in the system. The proposed approach addresses these challenges by incorporating an adaptive L1-penalty and inertia terms in the loss function, and thus called Inertial Regularization and Selection (IRS). The former term performs model selection to handle the first challenge while the latter is shown to address the last two challenges. A recursive estimation algorithm is developed, and shown to outperform the commonly used state-space models, such as Kalman Filters, in experimental studies and real data.

Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity


Chitta Ranjan
Georgia Institute of Technology
Atlanta, U.S.A.

nk.chitta.ranjan@gatech.edu
Samaneh Ebrahimi
Georgia Institute of Technology
Atlanta, U.S.A.

samaneh.ebrahimi@gatech.edu
Kamran Paynabar
Georgia Institute of Technology
Atlanta, U.S.A.

kamran.paynabar@gatech.edu


\@float

copyrightbox[b]

\end@float
  • Computing methodologies Machine learning algorithms;

    • regression; data stream; high-dimensional

      In recent times, huge amount of data is collected every day for most of statistical learning processes. For eg., Twitter, Facebook, Walmart, Weather, generates tremendous amount of data everyday. This leads to accumulation of large amounts of data over time. Such processes are also sometimes called as a data-streaming process. Modeling an entire datastream together is computationally challenging. Although, methods like Stochastic Gradient Method has been proven to work on such large data, it will fail when the process is also changing over time. We call such changing systems as evolving.

      Take for example, the behavior of users on YouTube evolves over time. The user usage data is collected by YouTube, which should be used for continuously modeling and updating their recommender system. Another example can be found at retailers demand forecasting models. For eg., at Walmart several thousands of products are sold. For each type, several new brands are introduced or replaced every day. The demands of all these products interact with each other. The model should, thus, be continuously updated to accomodate for the dynamic changes made in the demand system.

      Sequential modeling of such processes can alleviate the challenges posed by the dynamic changes in evolution in the system. Moreover, it significantly reduces the computation complexities. Several researchers have worked in this direction to develop methods specific to certain problems, for eg. evolutionary recommendation systems [?].

      Broadly, the existing methods use a data-pruning or sequential models. A data-pruning approach takes the most recent data, either using a moving window or giving decreasing weight based on the oldness of the data. These methods are, however, ad-hoc with the window size or the weight functions difficult to optimize. Moreover, they do not appropriately use the model information learnt so far.

      Sequential models, on the other hand, are better at using the past information on the model and updating it from the newly observed data. However, the commonly used sequential methods, like Kalman Filters, Particle Filter, etc. are not suitable for high-dimensional data. Moreover, they are not capable of optimally selecting a sub-model when there is sparsity.

      On the other hand, the abovementioned problems are high-dimensional and sparse. Besides, following are the inherent challenges in the problem at hand,

      1. Resemblance: On updating the model from new data, any abrupt model change can be difficult to interpret. For eg., suppose in the previous model a predictor variable, age and sex are significant and insignificant, respectively, but after an update in current epoch, the significance of age and sex reversed. Such abrupt changes can easily occur, due to multicollinearities and often performed suboptimal search of objective space and, thus, many possible solutions. However, interpreting adruptly changing solutions is difficult and undesirable.

      2. Computation: Since the data size will increase exponentially, modeling all the data is computationally impractical. An effective sequential modeling is required to break the computation into smaller parts.

      3. Robust (to noise and missing values): Some variables can be unobserved (or scantily observed) in certain epochs. For example, at Walmart several products get added/removed everyday or suspended for some time. It is still essential to keep track of these dynamic changes and appropriately handle the model. Also, there can be a large amount of noise occuring in an epoch (due to any external event). For example, someone else using your Netflix for a short while, often leading to your recommendations going of the chart.

      4. Accuracy: Obtaining high prediction accuracy in a changing and high-dimensional process is difficult. Specifically, overfitting is a common issue in modeling a high-dimensional sparse model with poor model selection, leading to a high test error.

      It is, therefore, important to develop a new method that can sequentially model a high-dimensional data streaming process, with inherent sparsity and evolution. To that end, we propose a method called Inertial Regularization and Selection (IRS). Similar to state-space models, IRS sequentially models them over subsequent time epochs. A time epoch can be a day, a week, or any other time interval at which the model is required to be updated.

      IRS has an inertia component in the objective loss function. This inertia component resists abrupt model change, bringing resemblance with model learnt in earlier epochs. This also enables IRS to use the prior knowledge learnt on the model so far, together with the current epoch’s data, for a yield an accurate model. It also keeps the model smooth and robust to intermittent noisy observations. Besides, unlike other state-space models, it works better in high-dimensional and sparse problems due to its automatic optimal model selection property. An optimal sparse model estimate prevents overfitting, and thus, lowers test prediction errors.

      IRS is a domain agnostic (generic) methodology that can handle any large or small statistical learning problems with different specifications. In this paper, it is developed for linear relationships and gaussian errors, however, it can be easily extended to non-linear scenarios using a link function. We show existing related methods as a special case of IRS, viz. Kalman filters, (adaptive) Lasso and Bayesian models. However, these models have certain critical limitations, especially in terms of ability to handle high-dimensions, use of past information (sequential modeling) and computation, discussed more in §Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity below, which is overcome by IRS. Through development of IRS, this paper also shows the connection between Kalman Filters, regularization methods, and Bayesian frameworks. As a by-product, we also show an estimation procedure for Kalman filter which has significantly faster computation in the usual case of data size being larger than the number of predictors.

      In the remaining of this paper, we first review the related work in this area in §Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity. Thereafter, we show the development of IRS in §Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity. This section provides the definition and elaboration of IRS, followed by development of IRS regularization parameters (§LABEL:subsec:Regularization-parameters). Besides, within this section, equivalence of IRS with other methods as special cases are shown in §LABEL:subsec:Lasso-equivalence-LABEL:subsec:Equivalent-Bayesian-Formulation. Further, we present a closed-form solution under orthogonality conditions (in §LABEL:subsec:Closed-form-solution-under) and an iterative solution, for any general case, in Sec. LABEL:subsec:Iterative-estimation. An algorithm based on the iterative solution is postulated in §LABEL:subsec:Algorithm, which also discusses various computational and implementation aspects of IRS. Furthermore, we experimentally validate IRS’s efficacy in §Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity, and show its application on a real online retail data in §Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity. Finally, we discuss the paper and conclude in §Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity.

      Sequential modeling of statistical learning, specifically in context of regression, was attempted by Gauss (1777–1885) more than a century ago. However, his method, called as recursive least-squares (RLS) estimation came into light several years later after a rediscovery by [?]. RLS works by estimating regression parameters in a model with the available data and updates it when new data is observed. However, it works if the model parameters are expected to be stable, i.e. the model is not evolving. Moreover, since the entire data until the present is used for the model, the effect of successive new data on model starts to diminish. To account for this, a rolling regression, which a data-pruning approach that keeps only a fixed window of data for modeling. A gradual discounting of old data using some decaying function is also used. Nevertheless, RLS still suffers from its inability to effectively model an evolving process, and also in handle high-dimensional data.

      To better account for the model evolution, Kalman filters (KF) were developed[?]. KF is a sequential learning procedure, where the model at any any epoch is estimated as a weighted average of its predicted state from the previous state and the estimate from the new data. For a consistent estimation, KF incorporates the model’s covariance for the model reestimations — the weights in the weighted average proportional to this covariance —, but it is not designed to handle high-dimensional data. Other generalizations of KF, viz. extended Kalman filter and unscented Kalman filter [?] were developed for nonlinear systems, but also lacked in their ability to model high-dimensional data.

      Ensemble Kalman filter (EnKF)[?], was developed for high-dimensional situations. It is a combination of the sequential Monte Carlo and Kalman filter. However, EnKFs are majorly focused on a proper approximation of the model’s covariance matrix in situations of high-dimension but less data (via. Monte Carlo approximations). Here an ensemble a collection of model realizations from different Monte Carlo generated data replications, from which the model covariance is approximated as the sample covariance. As can be noted, it does not address the issue of sparse model selection under high-dimensionality. While some “localization” techniques are some times used for dimension reduction[?], for eg. in geophysics the model is decomposed into many overlapping local patches based on their physical location, filtering is performed on each local patch, and the local updates are combined to get the complete model. However, this technique is not easy to generalize to any model.

      Particle filters (PF)[?] are another class of method for state-space modeling under nonlinearity. It is usually seen as an alternative to KFs in nonlinear scenarios. But they severely suffer from the “curse of dimensionality” due to its nonparametric nature even for moderately large dimension[?]. [?] developed an ensemble filtering method by combining the EnKF and PF for high-dimensional settings. They used a similar “localization” technique described above, which limits its application.

      State-space models, specifically Kalman filter, have been extensively developed in the field of signal processing. To handle high-dimension and sparsities in signals the methods extended the Kalman filters to address them. [?] proposed a Kalman filtered compressed sensing, which was further extended in [?], who directly modified the Kalman filter equations using compressed sensing and quasi-norm constraints to account for sparsity. However, these methods lack theoretical support and has limited application. [?, ?] showed different signal regularization possibilities for different signal dynamics scenarios. They also showed an extension of [?] and [?] by adding an penalty in the optimization objective. While this is close to our objective, they used fixed penalty parameters, and did not develop the methodology. Besides, [?] proprosed a special bayesian formulation with Gamma prior distributions that is good for the signal problems addressed in the paper but not generically applicable. Moreover, similar to a fused lasso setup, a L1 penalty to smoothen the dynamically changing system together with L1 regularization for sparsity was also proposed by them and [?]. However, these methods are difficult to optimize and, do not guarantee the covergence from an erroneous initialization to a steady-state estimation error.

      In contrast to the existing methods, the proposed IRS can easily handle linear and nonlinear (with linear link function) models in high-dimensional space. As we will show in this paper, it provides an efficient sequential modeling approach to accurate model such problems.

      The following defines an evolutionary model for a data streaming process.

      (1)
      (2)

      Here denotes a time epoch for modeling. An epoch can be a day, a week or any other time span, during which the model state is assumed to not change. The observed data at is denoted by for all independent variables (also referred to as predictors) and is the response. Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity is conventional statistical model where the response is a linear function of predictors. is the model parameter and is an error term from a normal distribution with mean 0 and covariance , i.e. . Besides, also specifies the model state at epoch .

      While Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity models the response, Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity captures the any change in the state. Often these state changes are small and/or smooth (gradual). As also mentioned before, these changes are called as evolution. Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity also has a linear structure, where is called state transition parameter and is the state change process noise, assumed to be . Besides, in the rest of the paper we refer Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity as state-model and Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity as response-model.

      The data sample size and the number of predictor variables at is denoted as and , respectively. Thus, and . Additionally, the covariance of is denoted as .

      The objective of IRS model is to sequentially learn a statistical model from a streaming data from a high dimensional process. For that, the following objective function is proposed that, upon minimization, yields an optimal model estimate at any epoch .

      (3)

      The above objective comprises of three components:

      a) The first term is a sum of residual squares scaled by the residual error covariance,

      b) The second term is a -norm regularization that penalizes any drastic change in the model estimate. A change here implies the difference in the model state at from the expected model state at , given the state at . The second term is, thus, called an inertia component because it resists drastic changes, and in effect, keeps the updated model estimate close to the anticipated model derived from the past model. Besides, this inertia also resists a change due to any noisy observations. These properties are a result of using as an adaptive penalty for an effective model updation, and also helps the inertia component differentiate an abnormal observation from noise (discussed in §LABEL:subsec:Regularization-parameters and Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity).

      c) The third term is a -norm penalty term for an optimal model selection. Although, the penalty is imposed “spatially”, i.e. only on the model parameters at epoch , the selected model will still resemble with the selected model in the past. This is in part due to the inertial component and an adaptive penalty parameter . This aspect is further elaborated in §LABEL:subsec:Regularization-parameters.

      In Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity, is unknown, but it is shown in Appendix-A that it suffices to use the previous epoch’s model estimate instead of their true values. This is a founding property for IRS, that allows a sequential (online) modeling. can be, thus, replaced with its estimate . For a succinct notation, we denote , where is the estimate model state at given . Similarly, the expected covariance of , denoted by , is derived from the previous state’s covariance, where . Likewise, is replaced by . The resultant online objective function for sequential modeling is, thus,

      (4)

      Besides, hyperparameters and regulates the amount of penalty imposed for limiting the model size (sparsity) and closeness to the expected model given the past, respectively. and are later shown to be independent of epoch unless there is any true drastic change in the system. This is an important property because it removes the computational overhead of searching optimal and for every epoch. Besides, each term in the objective is normalized by their size, or , to meet this end.

      On another note, Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity bears close resemblance with an Elastic-net formulation (Zou and Hastie, 2005). Their difference, illustrated graphically in Fig. (a)a-(b)b for a simple 2-dimensional model, further explains and highlights the differentiating properties of IRS. As shown in the figures, the penalty term in Elastic-net is centered around 0, while IRS’s regularization is around, , the model estimate given the past or the prior knowledge. Moreover, IRS attempts to find a sparse model that also matches the prior knowledge. This reinstates the previous claim that despite of a spatial penalty the resultant model resembles the model derived from the past.


      (a) Elastic-net

      (b) Inertial Regularization and Selection (IRS)
      Figure \thefigure: Graphical illustration of proposed Inertial Regularization and Selection (IRS), and contrasting it with Elastic-net.

      In a sequential regularization model, it is critical to update the regularization penalties in a data driven manner. Otherwise, a repetitive regularization optimization for each epoch can be computationally cumbersome. We, therefore, use estimates of and , which are derived from the data and have the necessary properties, discussed below, that results into an accurate model estimation.

      For the inertia part, we want close to the expected value of given , i.e. . But the extent of regularization should be different for different ’s. The with higher expected variance should be regularized lesser. This is because, higher variance indicates that the ’s estimate from the past is not reliable. Thus, it should be learnt more from a recent data from epoch than keeping close to the past. And it is vice-versa if the expected variance of is small — tend to keep it close to a reliable past knowledge. Therefore, the inertia should be weighted by the inverse of the variance. Besides, as the correlations between the model parameters should be taken into account, as well, to adjust the changes in correlated parameters, we use as an adaptive inertia weight.

      In addition to the above intuition, also stabilizes the inertial component. The inertial component can be regarded as a sum of residual squares from Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity, i.e. we aim to minimize the residuals of . Thus, the residual sum of squares should be scaled by its covariance, where .

      Besides, an adaptive penalty is applied on the norm for model selection. As suggested in [?], inverse of a reasonable estimate for provides a good adaptive penalty term for norm. [?] suggested an OLS estimate for a Lasso problem as a good estimate for the adaptive penalty, as it brings estimation consistency and guarantees oracle properties. As shown later, in §LABEL:subsec:Lasso-equivalence, the IRS objective function in Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity is equivalent to a Lasso problem. From this Lasso equivalence, it can be seen that the OLS estimate desired for adaptively penalizing is same as an only inertia regularized estimate for , denoted as . The expression for is derived and given in Appendix-B. We, therefore, use as penalty for the ’s regularization.

      Here we show a Lasso equivalent formulation of IRS. This is an important property because it shows IRS can inherit from a well developed pool of Lasso based techniques. In fact, the use of as the adaptive penalty for model selection is drawn from this Lasso equivalence. Besides, the estimation algorithm for IRS developed in §LABEL:subsec:Solution is also inspired from Lasso solutions.

      The following shows the Lasso equivalence using a transformation and augmentation of data and .

      (5)

      Using transformed data in Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity, we can rewrite the objective function in Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity as a Lasso problem, as given below (see Appendix-C for proof).

      (6)

      The Kalman Filter is found to be a special case of IRS. This finding highlights the commonality between filtering methods and regularization.

      When , IRS becomes equivalent to a more general Kalman filter which has a weight adjusted state update, i.e. a weighted average of the state estimate from past and a Kalman estimate from the current data. The following shows the IRS estimate expression as equal to a Kalman filter expression (see Appendix-D for the proof).

      (7)

      where, , also known as Kalman gain, and . As we can see above (Eq. 7), the expression for estimate is same as a conventional (unweighted) Kalman filter when , and otherwise a weighted Kalman filter when is any other positive value. Besides, when , the covariance of can be shown to be same as in Kalman filter.

      (8)

      It can be noted that, for estimating and , the Kalman filter expression requires inverse of . , however, is a matrix, inverting which can be computationally expensive when data size, , is large. We provide an implementation for Kalman filter in Appendix-D, which is significantly faster for a usual scenario of .

      The Bayesian formulation of IRS will have the following prior for ,

      (9)

      This prior is a joint distribution of a double-exponential distribution for the penalty and a Gaussian centered around and scaled with the covariance for the inertial penalty. The parameters can be estimated as the Bayes posterior mode (see Appendix-A).

      Due to an norm term in the loss function, it is difficult to obtain an accurate estimate for the model covariance, . Bootstraping can be a numerical way to estimate it. However, bootstrapping can be time consuming. Therefore, we propose an approximate closed form estimate, similar to the one proposed in [?]. In [?], the penalty was replaced with , and the covariance matrix was approximated at the lasso estimate, , as a ridge regression expression.

      This estimation, however, gives a 0 variance estimate if ’s equal to 0. This is not desired in IRS, because a variance estimate for each is necessary for the inertial component. We, therefore, provide a slightly modified covariance expression, where is replaced by . Thus, an approximate covariance estimate is given as,

      (10)

      where,

      (11)
      (12)

      and .

      In this section, we first develop a closed form solution under orthogonality conditions, and then present an iterative proximal gradient estimation method for any general case. An algorithm is then laid out for implementing IRS in §LABEL:subsec:Algorithm, where the various computational and implementation aspects are discussed.

      The developed estimation technique possesses a closed-form solution if, (a) the predictors are orthonormal (), i.e. all columns in are linearly independent of each other, (b) the error of response model is i.i.d., , and, (c) the error covariance of the state model is i.i.d., , which is often implied from if (a) is true.

      Orthogonality conditions, a) , b) , and c) . Under these conditions, , and in turn , will be diagonal matrices. Suppose we denote, , where is the expected variance of .

      Plugging these conditions into Eq. 22, we get the OLS estimate expression as,

      (13)

      where, and .

      We will use this expression to derive the IRS closed-form solution under orthogonality. From the Lasso equivalent formulation of IRS, the parameter estimate can be computed by minimizing, Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity, i.e.,

      (14)
      (15)

      \

      On solving this for both cases when, and , we get the following solution,

      (16)

      where, is a soft-thresholding function.

      The components in the solution expression (Eq. 16) can be interpreted as a shrinkage on the inertia regularized parameter (OLS) estimate, , such that the amount of shrinkage is determined by, a) the magnitude of , for an adaptive variable selection from penalty, and b) , which is proportional to the expected variance of the OLS estimate, , thus, IRS strongly shrinks if its expected variance is high.

      Therefore, parameters with high variance and low magnitude will be removed. The addition of variance element boosts the model selection capability to select a better model, resulting into a lower prediction errors.

      Since, most commonly found problems do not satisfy an orthogonality condition, we develop a proximal gradient method to estimate the parameters in a general case.

      Since, is a function, an iterative proximal gradient algorithm will perform the following update,

      (17)

      where, and , is a step size for iteration (the step size can be kept same for all or decreased with increasing ) and

      (18)

      Note that, any other estimation technique, for example, subgradient method or Least-Angle Regression (LARS), for estimating . We chose proximal gradient method because of its ease of understanding and implementation. Besides, it performs well in most circumstances.

      In this section, we present Algorithm-1 for implementing the iterative estimation method discussed above. The algorithm takes in the predictor matrices, , and the responses, , as they are collected for epoch . The ’s are standardized within each epoch , such that the columns in has a mean of 0 and variance of 1. , on the other hand, is centered to remove an intercept from the model.

      The algorithm assumes and as known inputs. In practice, is often set as equal to an identity matrix, , and as , where is a small positive number. Besides, the response error, , in Eq. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity, is assumed as i.i.d., i.e. , and . While these are reasonable assumptions for most applications, it can be easily extended for circumstances outside of these assumptions.

      As shown in Algorithm-1, the IRS estimation process starts from line-24. To initialize, an OLS estimate is computed, and residual sum of squares is used for obtaining . Besides, the covariance is initialized as identity. Thereafter, for any , , and (for approximating ) are computed and passed into ProximalDescent function, alongwith the data and regularization parameters, . This function implements the proximal gradient steps shown above in §LABEL:subsec:Iterative-estimation. Note that, it takes inverse of to save computation. Besides, it is recommended to have relatively larger data size in the initial epoch to avoid yielding a sub-optimal initial model.

      A common issue in regularization models is tuning the penalty parameters, viz. in IRS. It would have been even more challenging if IRS required to (re)tune for each estimation epoch. However, due to the adaptive regularization developed in §LABEL:subsec:Regularization-parameters, it can be shown that the expected value of the loss function, , remains a constant, i.e. independent of epoch t (see Appendix ??). This implies that an appropriately chosen can be used for any .

      Therefore, for IRS, we can use a k-fold cross validation (CV) on a two-dimensional grid search for , and model first few epochs for each grid point and measure their prediction errors. We used a 10-fold CV on first three epochs for selection. Zou and Hastie (2005) have described this approach in detail and provided other possible selection approaches.

      Computational time is another common issue in estimation procedures. The developed proximal gradient based algorithm has a fast convergence, which can be further boosted by changing the stepsize, , for each iteration in line 11-18. This is common practice, with one simple approach as starting with a large value for and gradually decreasing it after each iteration. Another approach is to adaptively increase or decrease based on an increase or decrease in the residual error from . Using this, the proximal gradient was found to converge in less than 50 iterations for upto 1000-dimensional models.

      Besides, in implementation, operations in lines 13-16 can be vectorized and approximate matrix inversion methods can be used for a faster computation. Furthermore, accelerated proximal gradient and stochastic proximal gradient methods can be used to boost the computation.

      1:  
      2:function Prox()
      3:     
      4:     return
      5:end function
      6:
      7:function ProximalDescent()
      9:     
      10:     
      11:     
      12:     
      13:
      14:     Initialize:
      15:     
      16:     while  do
      17:         
      18:         for  do
      19:              
      20:               Prox()
      21:         end for
      22:         
      23:     end while
      24:     
      25:     
      26:     
      27:     return
      28:end function
      29:
      30:Initialize:
      31:     
      32:
      33:
      34:IRS Estimation for,
      35:
      36:
      37:
      38:
      39:
      40:ProximalDescent()
      41: 
      42:Model estimate:
      Algorithm 1 Inertial Regularization and Selection (IRS) estimation

      A beneficial feature of IRS is the ease of handling missing values and any change in model structure. A change in model structure implies any addition or removal of model predictors. In case of missing values, IRS can still provide a reasonable parameter estimate in a modeling epoch by using the prior knowledge.

      If a new predictor variable is added, for eg., one new product at Walmart, it can be easily incorporated in the modeling by expanding and . If there is no expert prior knowledge for the new predictor, can be expanded with 0 values and with 0 off-diagonal (no correlation) and a large variance. This is same as an uninformative prior used is a Bayesian estimation. However, in several cases there is some expert knowledge. For instance, in the above Walmart example if the new product is a new baby diaper, we will have some knowledge for its parameter from similar products, and/or its correlations with other baby and related products.

      Besides, IRS can handle removal of a predictor more cleverly than most other conventional methods. A removed predictor does not mean its effect will be gone immediately. In fact it will take some time before its effect vanishes completely. For eg. suppose a TV brand, say XYZ, is removed from a Walmart store, but since XYZ’s sales must be interacting with other TV sales, the effect of its absence will be felt for some time (maybe raising the demand for others) before gradually dying out. In IRS, it is not necessary to manually remove any predictor variable from the model even if we know it is physically removed. The inertial and selection component in IRS will automatically update the resultant effect and eventually remove the predictor variables from the model when its effect declines completely.

      It is worthwhile to note that this IRS property also handles situations when the removal of the predictor is temporary. In the above Walmart eg., a reintroduction of the TV XYZ can be easily incorporated in the modeling updates, with the effect of reintroduction being the opposite of removal — the effect gradually increasing in the model.

      This ability to easily handle missing values and structural changes in model is a valuable property because it makes the sequential modeling more robust to the real world scenarios.

      In this section, we experimentally validate the efficacy of IRS. We compare it with Lasso, Kalman Filter (KF) and Ensemble KF (EnKF). Lasso was applied locally on each epoch to show a prediction error baseline if the prior knowledge is ignored in sequential process. KF and EnKF are commonly used state-of-the-art state-space models against which IRS’s performance will be gauged.

      We setup two types of experiments for validation, denoted as Exp-1 and Exp-2. In Exp-1, the number of model parameters (predictors) are kept same for all epochs, . The model parameters are randomly initialized for from a multivariate normal distribution with mean as 0, and an arbitrary covariance, such that a sparsity level of 0.2 is maintained. As the model enters from epoch to , its parameters change with a random walk of variance 1. The sample size is randomly selected between . The data is generated for 9 epochs, i.e. . A 10-fold CV is performed on initial three epochs sample for regularization parameter selection. Thereafter, again a 10-fold CV is performed to fit each method, and the average test prediction error, rMSEs, are shown in Fig. (a)a-(b)b.

      The figures show that Lasso has a similar rMSE as IRS in the initial epoch. Thereafter, while Lasso’s prediction error remain approximately at the same level, IRS’s error reduces. This is due to the use of knowledge used from the past for updating the model with current data. KF and EnKF similar had a reducing error rates, however, they had a higher error than IRS and Lasso for the most of epochs. This can be attributed to their inability to work in a high-dimensional sparse processes. Besides, their performance seems to worsen when , compared to when . This maybe due to KF and EnKF overfitting the data, resulting into higher test prediction errors, that gets higher as dimension increases.

      In Exp-2, we initialized the model similar as Exp-1. However, for subsequent epochs, , a parameter in was changed as, a) if is 0, it can become non-zero at with a small activation probability , and its value is again drawn from the initial multivariate normal distribution, b) if , then it can be deactivated (made equal to ) if it is less than a deactivation threshold with a probability , c) if and it is not deactivated, then it is changed with a random walk, , or directionally with added random noise .

      This setup emulates a real world problem where, an otherwise insignificant variable becomes significant and vice-versa, due to any systemic changes. Also, the shift in is sometimes directional, indicating an evolution. Regardless, we will perform the sequential modeling assuming , to stay in tandem with the common state-space modeling practice.

      In addition to incorporating evolution, we also reduce the data sample size for progressing epochs to test each methods’ efficacy in working with growing prior knowledge and lesser data.

      Again, a 10-fold CV is performed and the average rMSEs from test prediction is reported in Fig. (a)a-(b)b. As expected, the errors increase with passing epochs due to lesser sample size. While in the initial epochs, all methods have roughly the same error rate, it increases sharply for a localized Lasso fit. Between, EnKF and KF, the latters error rate grows faster and surpasses the rMSE of EnKF after some epochs pass. This can be due to EnKF’s better ability to estimate the parameter covariance even with lesser data (using the Monte Carlo setup). Among all, the increase in error rate is the lowest for IRS.

      This result shows that IRS is also robust to the evolutionary model changes, and can work with lesser data as it garners more prior knowledge. This is an important feature for sequential modeling — indicating the ability of quickly updating the model as the system gets older.

      (a)
      (b)
      Figure \thefigure: Exp-1: Average rMSE of test errors from a 10-fold cross validation.
      (a)
      (b)
      Figure \thefigure: Exp-2: Average rMSE of test errors from a 10-fold cross validation.

      The results in this section show the importance of use of past information for sequentially modeling data that arrives in a streaming fashion. For high-dimensional sparse and evolving models, it is shown that IRS outperforms the state-space models popular for such sequential modeling. Next, we will show a real world application of IRS.

      Here we show a real world application of IRS using a retail data from a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts to mostly wholesalers and some individuals. The company was established in 1981 and relied on direct mailing catalogues. It went online in about 2010, since when it started accumulating huge amount of data.

      A subset of that data used in this study lies between Dec, 2010 to Dec, 2011. The data is a customer transaction dataset, which has several variables, namely, product name, quantity sold, invoice date-time, unit price, customer-id, and country. For our analysis, we drop the variable customer-id, and select data from the country, UK. From this filtered data, we take a subset belonging to 41 products, such that it comprises of high to low selling items.

      The “quantity sold” for a product in a transaction is the response variable, and the model’s objective is to predict it using all available predictors. For predictors, we extract the day-of-week and quarter-of-day from the invoice date-time to form predictors. Month is not taken because the data is for just one year, thus, deriving month seasonal effects will not be possible. Besides, other predictors are the unit price and product name. We create dummy columns for the categorical product name and day-of-week variables. Finally, a predictor matrix is made by taking in all predictor variables and their second-order interactions. Interaction effects are important to consider because sales of one product usually effects sales of some others. In total, the predictor matrix has 363 columns, that is, we have a 363-dimensional model.

      Each month is treated as a modeling epoch. A 10-fold CV is performed on the first three epochs for penalty parameters, , selection, and then IRS is performed sequentially on all epochs. The 10-fold CV test prediction error (mean absolute percentage error, MAPE) in Fig. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity show that IRS is more accurate than others. This is because several predictors in the above model will be statistically insignificant, thus, would require an effective model selection, in addition to use of the prior knowledge. However, as also mentioned before, KF and EnKF cannot effectively find a parsimonious model, and, thus, has poorer test accuracy.

      In addition, we show another result of prediction errors for low selling products in Fig. Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity. Demand forecast of low selling items are usually more difficult because of fewer available data. As shown in the figure, IRS again outperfoms other models. This can be attributed to IRS’s better capability is working with lesser data.

      An online retail data, such as the one used here, is often very high-dimensional due to several categorical variables and interactions. Besides, the demand patterns typically have an inherent evolutionary characteristics. Note that, this evolution is different from the seasonalities and trend. Evolution is a gradual change due to various external factors, that are hard to measure. IRS is, thus, well suited for such problems. It can account for the evolutionary changes, and can handle high-dimensions, as well as, an automatic model selection.

      Figure \thefigure: Forecasting accuracy for a randomly selected set of items.
      Figure \thefigure: Forecasting accuracy for low selling items.

      In the previous section (§Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity), we show IRS’s superior prediction accuracy in an online retail data problem. This is an example of a high-dimensional statistical learning problem with sparsity and large data collection in a streaming fashion, where the process can also evolve. A majority of statistical learning problems found these days have similar properties. For instance, sensors data, telecommunications, weblogs, console logs, user music/movie play history (on services like spotify and Netflix), etc. Besides, in such problems various variables get added or removed from the process over time. Under all this fast-paced dynamics, IRS can be easily applied and yield a high accuracy.

      In general, IRS’s ability to auto-tune its regularization parameters (given in §LABEL:subsec:Regularization-parameters) from the data makes it easy to implement. Due to this auto-tune ability, as further elucidated in §LABEL:subsec:Algorithm, IRS has just two hyperparameters, , which if suitably selected provides an optimal solution for any epoch . However, in event of any abnormal or extreme system change, may need to be re-tuned. Nevertheless, under normal conditions IRS provides a seamless sequential estimation for a statistical learning model.

      Besides, the combination of inertial and penalty terms bring important properties of resemblance and model selection. Due to the inertia penalty, the model does not drastically change between epochs, thus, keeps a resemblance with what an analyst already knows about the model. The penalty provides an appropriate model selection on top of it. It must be noted that in model selection, the prior knowledge is also incorporated by using an adaptive penalty. This prevents the penalty from selecting an extremely different model than the past.

      However, a potential downside of the resemblance property could be if the initial model is severely sub-optimal, the subsequent models can also be poor. In such cases, it may take several epochs for the model estimate to get close to the optimal. Therefore, to avoid this, it is recommended in §LABEL:subsec:Algorithm to use a larger amount of data in the initial epoch.

      Regardless, IRS is also robust to noise and missing values due to the inertia component. This, alongwith an optimal model selection via penalty significantly boosts the test prediction accuracy, demonstrated in §Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity-Inertial Regularization and Selection (IRS)—Sequential Regression in High-Dimension and Sparsity.

      In this paper, we show IRS’s equivalence with Lasso, Kalman Filter and Bayesian estimation in §LABEL:subsec:Lasso-equivalence-LABEL:subsec:Equivalent-Bayesian-Formulation. This also shows the interconnection between them for a statistical learning. Specifically, Kalman filters are often seen as a different family of estimators isolated from regularization techniques. Here, we establish their connection, and also present a faster implementation for Kalman filter in Appendix-D.

      In summary, we develop a new method – Inertial Regularization and Selection (IRS) – for sequentially modelin an evolving statistical learning model. We present the modeling approach for a linear state-change (evolution) and a linear state-response model with gaussian errors. However, it can be easily extended to any non-linear model, if there is a link function transforming it into a linear form. For eg., a binary response in a classification model can be modeled using a logit-link for the state response model. The method is shown to outperform existing methods and several real world applications.

      • [1] E. Candes and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse problems, 23(3):969, 2007.
      • [2] E. J. Candes, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted 1 minimization. Journal of Fourier analysis and applications, 14(5-6):877–905, 2008.
      • [3] A. Carmi, P. Gurfil, and D. Kanevsky. Methods for sparse signal recovery using kalman filtering with embedded pseudo-measurement norms and quasi-norms. IEEE Transactions on Signal Processing, 58(4):2405–2409, 2010.
      • [4] A. Charles, M. S. Asif, J. Romberg, and C. Rozell. Sparsity penalties in dynamical system estimation. In Information Sciences and Systems (CISS), 2011 45th Annual Conference on, pages 1–6. IEEE, 2011.
      • [5] A. S. Charles and C. J. Rozell. Dynamic filtering of sparse signals using reweighted 1. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6451–6455. IEEE, 2013.
      • [6] Y. Ding and X. Li. Time weight collaborative filtering. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 485–492. ACM, 2005.
      • [7] G. Evensen. The ensemble kalman filter: Theoretical formulation and practical implementation. Ocean dynamics, 53(4):343–367, 2003.
      • [8] N. J. Gordon, D. J. Salmond, and A. F. Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEE Proceedings F-Radar and Signal Processing, volume 140, pages 107–113. IET, 1993.
      • [9] S. J. Julier and J. K. Uhlmann. Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3):401–422, 2004.
      • [10] R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
      • [11] J. Lei and P. Bickel. Ensemble filtering for high dimensional non-linear state space models. University of California, Berkeley, Rep, 779:23, 2009.
      • [12] J. S. Liu. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2008.
      • [13] E. Ott, B. R. Hunt, I. Szunyogh, A. V. Zimin, E. J. Kostelich, M. Corazza, E. Kalnay, D. Patil, and J. A. Yorke. A local ensemble kalman filter for atmospheric data assimilation. Tellus A, 56(5):415–428, 2004.
      • [14] R. L. Plackett. Some theorems in least squares. Biometrika, 37(1/2):149–157, 1950.
      • [15] D. Sejdinovic, C. Andrieu, and R. Piechocki. Bayesian sequential compressed sensing in sparse dynamical systems. In Allerton Conf. Communication, Control, and Computing. Citeseer, 2010.
      • [16] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
      • [17] N. Vaswani. Kalman filtered compressed sensing. In 2008 15th IEEE International Conference on Image Processing, pages 893–896. IEEE, 2008.
      • [18] H. Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
      \@ssect

      APPENDIX

      The parameter estimate is a Bayesian formulation can be found from the Maximum a Posteriori (MAP) function, as shown below,

      (19)

      A marginalization can be performed on Eq. 19 to find a globally optimal solution at the epoch.

      (20)