A Novel Framework for Online Supervised Learning with Feature Selection

A Novel Framework for Online Supervised Learning with Feature Selection

\nameLizhe Sun \emaillizhe.sun@stat.fsu.edu
\addrDepartment of Statistics
Florida State University
Tallahassee, FL 32306-4330, USA \AND\nameYangzi Guo \emailyguo@math.fsu.edu
\addrDepartment of Mathematics
Florida State University
Tallahassee, FL 32306-4330, USA \AND\nameAdrian Barbu \emailabarbu@stat.fsu.edu
\addrDepartment of Statistics
Florida State University
Tallahassee, FL 32306-4330, USA
Abstract

Current online learning methods suffer issues such as lower convergence rates and limited capability to recover the support of the true features compared to their offline counterparts. In this paper, we present a novel framework for online learning based on running averages and introduce a series of online versions of some popular existing offline methods such as Elastic Net, Minimax Concave Penalty and Feature Selection with Annealing. We prove the equivalence between our online methods and their offline counterparts and give theoretical true feature recovery and convergence guarantees for some of them. In contrast to the existing online methods, the proposed methods can extract models with any desired sparsity level at any time. Numerical experiments indicate that our new methods enjoy high accuracy of true feature recovery and a fast convergence rate, compared with standard online and offline algorithms. We also show how the running averages framework can be used for model adaptation in the presence of model drift. Finally, we present some applications to large datasets where again the proposed framework shows competitive results compared to popular online and offline algorithms.

\ShortHeadings

Sun and Barbu \firstpageno1

\editor{keywords}

online learning, feature selection, model adaptation

1 Introduction

Online learning is one of the most promising approaches to efficiently handle large scale machine learning problems. Nowadays, the datasets from various areas such as bioinformatics, medical imaging and computer vision are rapidly increasing in size, and one often encounters datasets so large that they cannot fit in the computer memory. Online methods are capable of addressing these issues by constructing the model sequentially, one example at a time. A comprehensive survey of the online learning and online optimization literature has been presented in Hazan (2016).

In this paper, we assume that a sequence of i.i.d observations are generated from an unknown distribution, and the goal is to minimize a loss function

(1)

where is a per-example loss function.

In online learning, the coefficient is estimated sequentially, from is obtained a coefficient vector . In the theoretical analysis of online learning, it is of interest to obtain an upper bound of the regret,

(2)

which measures what is lost compared to an offline optimization algorithm, and in a way measuring the speed of convergence of the online algorithms.

Traditional online algorithms are all designed based on a sequential procedure. Zinkevich (2003) proved that under the assumption that is Lipschitz-continuous and convex w.r.t , the regret enjoys the upper bound of . Furthermore, if is a strongly convex function, Hazan et al. (2007) showed that the regret has the logarithmic upper bound of .

However, traditional online algorithms have some limitations. Firstly, they cannot access the full gradient to update the parameter vector in each iteration. Online methods are sequential methods, using one observation or a mini-batch for acceleration (Cotter et al., 2011) in each iteration. As a consequence, online algorithms suffer a lower convergence rate than traditional batch learning algorithms, for general convexity and for strongly convex functions (Shalev-Shwartz and Ben-David, 2014). In comparison, offline gradient descent enjoys the convergence rate of . More importantly, the standard online algorithms, such as stochastic gradient descent, are not able to exploit the sparse structure of the feature vector, i.e. they cannot select features and recover the support of the true signal.

In this paper, we introduce a new framework for online learning, related to the statistical query model (Kearns, 1998; Chu et al., 2007). We will give more details about our new framework in Section 2.

1.1 Related Work

Online optimization and regularization. To cope with high dimensional data (e.g. ), various feature selection methods have been proposed to exploit the sparse structure of the coefficient vector. For instance, the - regularization has been widely used in linear regression as a sparsity inducing penalty. Also, several algorithms were designed to solve the feature selection problem in the online scenario. For online convex optimization, there are two main lines of research. One is the Forward-Backward-Splitting method (Duchi and Singer, 2009), building a framework for online proximal gradient (OPG). The other one is Xiao’s Regularized Dual Averaging method (RDA) (Xiao, 2010), which extended the primal-dual sub-gradient method from Nesterov (2009) to the online case. In addition, some online variants are developed in recent years, such as OPG-ADMM and RDA-ADMM in Suzuki (2013). Independently, Ouyang designed stochastic ADMM in Ouyang et al. (2013), the same algorithm as OPG-ADMM. Besides, truncated online gradient descent and truncated second order methods are proposed in Fan et al. (2018); Langford et al. (2009); Wu et al. (2017).

There is another line of research about online feature selection in the high dimensional case. In Yang et al. (2016), a new framework for online learning is proposed in which features arrive one by one, instead of observations, and we need to decide what features to retain. Unlike the traditional online learning, the disadvantage of this new online scenario is we cannot build a model for prediction until all relevant features are disclosed. In this paper, we assume that one can access observations sequentially with time, so we will not cover algorithms such as Yang et al. (2016) for comparison.

In Hazan et al. (2007), an online Newton method was proposed, which used a similar idea with the running averages to update the inverse of the Hessian matrix. This method enjoys the computational complexity , but did not address the issues of variable standardization and feature selection.

Figure 1: The solution path for online OLS-th (Left) and online Lasso (Right) for the Year Prediction MSD dataset.

1.2 Our Contributions

In this paper, we bring the following contributions:

  • we introduce a new framework for online learning based on the statistical query model (Kearns, 1998; Chu et al., 2007), and we call the methods under our framework as running averages methods. Many of the methods proposed in our framework enjoy a fast convergence rate and can recover the support of the true signal. Moreover, the proposed methods can address the issue of model selection, which is to obtain models with different sparsity levels and decide on the best model, e.g. using an AIC/BIC criterion. For example in Figure 1 are shown the solution paths obtained by the proposed online least squares with thresholding method, as well as the proposed online Lasso method.

  • in this framework we present online versions of popular offline algorithms such as OLS, Lasso (Tibshirani, 1996), Elastic Net (Zou and Hastie, 2005), Minimax Convex Penalty (MCP)(Zhang, 2010), and Feature selection with Annealing (FSA) (Barbu et al., 2017).

  • we prove that the online versions of the algorithms in our framework are equivalent to their offline counterparts, therefore bringing forward all the theoretical guarantees existent in the literature for the corresponding offline methods.

  • we prove convergence and true feature recovery bounds for OLS with thresholding and FSA, and we prove a regret bound for OLS with thresholding.

  • we conduct extensive experiments on real and simulated data in both regression and classification to verify the theoretical bounds and to compare the proposed methods with popular online and offline algorithms.

Memory Computation Convergence Feature True Feature
Algorithm Running Avgs. Algorithms Coefficients Regret Selection Recovery
SGD - Slow No No
SADMM(Ouyang et al., 2013) - Slow Yes No
SIHT(Fan et al., 2018) - Slow Yes No
OFSA Fast Yes Yes
OLS-th Yes Yes
OMCP Fast Yes Yes
OElnet Fast Yes No
Table 1: Comparison between different online methods

A brief summary of the convergence rates and computational complexity of various online methods including the proposed methods are shown in Table 1.

Finally, we summarize the advantages and disadvantages of the proposed running averages algorithms: although the proposed online methods based on running averages sacrifice computational complexity and memory compared with classical online methods, they enjoy a fast convergence rate and high estimation accuracy. More importantly, the proposed methods can select features and recover the support of true features with high accuracy and they can obtain models with any desired sparsity level for model selection at any time.

2 Setup and Notation

In this section, we will provide a general framework about running averages. First, we establish notation and problem settings. We denote vectors by lower case bold letters, such as , and scalars by lower case letters, e.g. . A sequence of vectors is denoted by subscripts, i.e. , and the entries in a vector are denoted by non-bold subscripts, like . We use upper case bold letters to denote matrices, such as , and upper case letters for random variables, like . Given a vector , we define vector norms: and .

2.1 Running Averages

The idea of running averages comes from the statistical query model and the issues of standard online methods. In mathematical statistics, given a distribution with unknown parameters and the i.i.d random variables , a sufficient statistic contains all the information necessary for estimating the model parameters.

In big data learning, the large datasets cannot fit in memory, and the online methods in the literature cannot recover the support of true features. Motivated by these concerns, we propose the running averages framework, which contains two modules, a running averages module that is updated online as new data is available, and a model extraction module that can build the model with any desired sparsity from the running averages. A diagram of the framework is shown in Figure 2.

Let , be observations with and , and we denote data matrix , . The running averages are the cumulative averages over the observations. They are

and the sample size . The running averages can be updated in an incremental manner, for example

(3)

similar to the procedure from Chapter 2.5 in Sutton and Barto (1998).

Figure 2: Diagram of the running averages based methods. The running averages are updated as the data is received. The model is extracted from the running averages only when desired.

The running averages have the following advantages: a) they cover all necessary sample information for model estimation, b) the dimension of the running averages will not increase with sample size , c) they can be used in the online learning setting because they can be updated one example at one time.

2.2 Data Standardization

Data standardization is an important procedure in real data analysis, especially for feature selection, because a feature could have an arbitrary scale (unit of measure) and the scale should not influence its importance in the model. For this purpose, the data matrix and the response vector are usually standardized by removing the mean, and is further standardized by bringing all columns to the same scale. However, because we discard the data and only use the running averages, we will need to standardize the running averages.

Denote , and by the sample standard deviation for the random variable . By running averages, we can estimate the standard deviation:

in which is the -th diagonal entry of the matrix . Then, denote by the diagonal matrix containing the inverse of standard deviations on the diagonal. Denoting by the standardized data matrix , and as the centralized , the original data can be standardized by

From these equations we obtain the running averages of the standardized dataset:

(4)
(5)

For convenience, hereinafter, we will still use and to represent the running averages after standardization.

3 Algorithms

In this section, we propose several running averages-based online algorithms. First, we design online least squares based on running averages, which can be used for feature selection by thresholding. We also propose the online feature selection with annealing (OFSA) to solve the constrained least squares problem. Then we consider some regularization models, such as Lasso, Elastic Net, and Minimax Concave Penalty. To simplify notation, we denote OLS to represent online least squares, OLSth for online least squares with thresholding, OLasso for online Lasso, OElnet for online elastic net, and OMCP for online minimax concave penalty.

3.1 Preliminaries

Before we start introducing the running averages-based algorithms, we prove that these online algorithms are equivalent to their offline counterparts. Actually, in our running averages framework, we share the same objective loss function with offline learning, which is the key point to prove their equivalence. {proposition} Consider the following penalized regression problem:

(6)

in which is the coefficient vector and is a penalty function. It is equivalent to the online optimization problem based on running averages.

(7)
{proof}

The loss function (6) can be rewritten as

in which , , and are running averages. Thus, the offline learning problem is equivalent to the running averages-based optimization.

3.2 Online Least Squares

In OLS, we need to find the solution for the equations . Since and can be computed by using running averages, we obtain:

(8)

Thus, online least squares is equivalent to offline least squares.

3.3 Online Least Squares with Thresholding

The OLSth is aimed at solving the following constrained minimization problem:

(9)

It is a non-convex and NP-hard problem because of the sparsity constraint. Here, we propose a three step procedure to solve it: first, we use the online least squares to estimate , then we remove unimportant variables according to the coefficient magnitudes , . Finally, we use least squares to refit the model on the subset of selected features. The prototype algorithm is described in Algorithm 1. In the high dimensional case , we can use the ridge regression estimator in the first step.

  Input: Training running averages and sample size , sparsity level .
  Output: Trained regression parameter vector with .
1:  Find by OLS.
2:  Keep only the variables with largest .
3:  Fit the model on the selected features by OLS.
Algorithm 1 OLS with Thresholding

3.4 Online Feature Selection with Annealing

Unlike OLSth, OFSA is an iterative thresholding algorithm. The OFSA algorithm can simultaneously solve the coefficient estimation problem and the feature selection problem. The main ideas in OFSA are: 1) uses an annealing plan to lessen the greediness in reducing the dimensionality from to , 2) removes irrelevant variables to facilitate computation. The algorithm starts with an initialized parameter , generally , and then alternates two basic steps: one is updating the parameters to minimize the loss by gradient descent

and the other one is a feature selection step that removes some variables based on the ranking of , . In the second step, we design an annealing schedule to decide the number of features we keep in each time period ,

More details are shown in Barbu et al. (2017) about the offline FSA algorithm, such as applications and theoretical analysis. For the square loss, the computation of

(10)

falls into our running averages framework. Thus, we derive the OFSA, which is equivalent to the offline FSA from Barbu et al. (2017). The algorithm is summarized in Algorithm 2.

  Input: Training running averages and sample size , sparsity level .
  Output: Trained regression parameter vector with .
  Initialize .
  for  to  do
     Update
     Keep only the variables with highest and renumber them .
  end for
  Fit the model on the selected features by OLS.
Algorithm 2 Online FSA

3.5 Online Regularization Methods

Penalized methods can also be used to select features, and we can map them into our running averages framework. A popular one is the Lasso estimator(Tibshirani, 1996), which solves the convex optimization problem

(11)

in which is a tuning parameter.

Besides Lasso, the SCAD(Fan and Li, 2001), Elastic Net(Zou and Hastie, 2005) and MCP(Zhang, 2010) were proposed to deal with the variable selection and estimation problem. Here, we use the gradient-based method with a thresholding operator to solve the regularized loss minimization problems (She et al., 2009). For instance, in Lasso and Elastic net, is the soft thresholding operator, and in MCP,

(12)

in which is a constant. The general algorithm is given in Algorithm 3.

  Input: Training running averages , sample size , penalty parameter .
  Output: Trained sparse regression parameter vector .
  Initialize .
  for  to  do
     Update
     Update
  end for
  Fit the model on the selected features by OLS.
Algorithm 3 Online Regularized Methods by GD

3.6 Online Classification Methods

The aforementioned algorithms not only can select features for regression, but can also be used for classification, even though these algorithms are based on the loss. In fact, for the two class problem with labels and , the coefficient vector for classification from linear least squares is proportional to the coefficient vector by linear discriminant analysis without intercept(Friedman et al., 2001). Besides, one can use the Lasso method to select variable for classification under some assumptions(Neykov et al., 2016). We will give the theoretical guarantees in Section 4.

3.7 Memory and Computational Complexity

In general, the memory complexity for the running averages is because is a matrix. The computational complexity of maintaining the running averages is . And except OLSth, the computational complexity for obtaining the model using the running average-based algorithms is based on the limited number of iterations, each taking time. As for OLSth, it is if done by Gaussian elimination or if done using an iterative method that takes much fewer iterations than . We can conclude that the running averages storage does not depend on the sample size , and the computation is linear in . Hence, when , compared to the batch learning algorithms, the running averages based methods need less memory and have less computational complexity. At the same time, they can achieve the same convergence rate as the batch learning algorithms.

3.8 Model Adaptation

Detecting changes in the underlying model and rapidly adapting to the changes are common problems in online learning, and some applications are based on varying-coefficient models(Javanmard, 2017). Our running averages online methods can adapt to coefficients change for large scale data streams. For that, the update equation (3) can be regarded in a more general form as

(13)

where we only show one of the running averages for illustration but the same type of updates are used for all of them.

The original running averages use , which gives all observations equal weight in the running average. For the coefficients-varying models, we use a larger value of that gives more weight to the recent observations. However, too much adaptation is also not desirable because in that case the model will not be able to recover some weak coefficients that can only be recovered given sufficiently many observations. More details about simulation and application will be covered in Section 5.

4 Theoretical Analysis

In this section we will give the theoretical analysis for our methods. First, because of Prop. 3.1, we have the equivalence of the online penalized models including Lasso, Elastic Net, SCAD and MCP with their offline counterparts, and thus all their theoretical guarantees of convergence, consistency, oracle inequalities, etc., carry over to their online counterparts.

In this section, we will first show that the OLSth and OFSA method can recover the support of the true features with high probability, and then we will provide a regret bound analysis for OLSth. The main idea of our proof come from Yuan et al. (2014). Then we will give a theoretical justification for the support recovery of our method in classification. The proofs are given in the Appendix.

The first result we present is: when the sample size is large enough, OLSth can recover the support of true features with high probability. In our theorem, the data is not normalized and the features do not have the same scale. Thus, we consider the data normalization in our theoretical analysis. Although the intercept is necessary in applications, we do not cover it here. {proposition} Suppose we have the linear model

where is the data matrix, in which , , are independently drawn from . Let and , and

(14)

Then with probability , the index set of top values of is exactly , where is the OLS estimate. The Proposition 4 shows theoretical guarantee of true feature recovery for OLSth. We can observe that the probability of true feature recovery does not depend on the true sparsity . We will verify it by numerical experiments in the next section. Here, we also give the theoretical guarantees for the data standardization case. {remark} Denote , . Given the conditions , for some satisfying , then with high probability the index set of top values of is exactly , where is the OLS estimate with standardized . {theorem}(True feature recovery for OLS-th) With the same notations as Proposition 4, if

(15)

where is the largest diagonal value of , then with probability the index set of top values of is exactly . Then we consider the theoretical guarantees of true feature recovery for OFSA algorithms. First, we need to give the definition of restricted strong convexity/smoothness. {definition}(Restricted Strong Convexity/Smoothness) For any integer , we say that a differentiable function is restricted strongly convex (RSC) with parameter and restricted strongly smooth (RSS) with parameter if there exist such that

(16)

In the linear regression case, the RSC/RSS conditions are equivalent to the restricted isometric property (RIP):

(17)

And in the low dimensional case, the RIP condition will degenerate to

{proposition}

With the same conditions as Proposition 4, let be an arbitrary -sparse vector, so . Let be the OFSA coefficient vector at iteration , be its support, and . If is a differentiable function which is -convex and -smooth, then for any learning rate , we have

where . {theorem}(Convergence of OFSA) With the same assumptions as Proposition 4, let and . Assume we have for any . Let be the diagonal matrix with the true standard deviations of respectively. Then, with the probability , the OFSA coefficient vector satisfies

{proof}

Because for any and , we have . Thus, we can get that and . Then, by using Proposition 4 recursively, we get the upper bound of the when the dimension of decreases from to . At the time period , we have

and at the time period , we also have

in which and are the number of selected features at time period and , respective. Thus, we have

Because we have , and , we get

Applying the same idea repeatedly all the way to we get

Since , where , we have

For the first term we have

and because is standardized and from Corollary Appendix A. Proofs we get

Therefore, with probability , we have

We can also use Lemma Appendix A. Proofs with and (since ) to get

So, with probability we have

and therefore

Please note that the dimension of the vector will reduce from to , thus we follow the Proposition 4 recursively with varying . Here, we assume that . Now we show that the OFSA algorithm can recover the support of true features with high probability. {corollary}(True feature recovery for OFSA) Under the conditions of Theorem 4, let

Then after iterations, the OFSA algorithm will output satisfying with probability . Finally, we consider the regret bound for the OLS and OLSth algorithms. In fact, all the feature selection algorithms we mentioned will degenerate to OLS if the true features are selected. First, we define the regret for a sparse model with sparsity levels :

(18)

in which is the coefficient vector at step and .

Observe that for , the loss functions from (18) are twice continuously differentiable. We denote and . We will need the following assumptions:

Assumption 1

Given , then we have satisfy

Assumption 2

Given , there exist constants and such that and .

{proposition}

(Regret of OLS) Given , under Assumptions 1 and 2, the regret of OLS satisfies:

{theorem}

(Regret of OLS-th) With the Assumptions 1, 2 holding for , there exists a constant such that if satisfies

(19)

where , then with probability at least the regret of OLSth satisfies:

Theoretical guarantees for feature selection in classification. Proposition 2.3.6 and Remark 2.3.7 from Neykov et al. (2016) show that the least squares Lasso algorithm (therefore the Online Lasso) can recover the support of true variables for the discrete under some assumptions. {theorem}(True support recovery) Consider the special case of a single index model, , in which and satisfies the irrepresentable condition. If , are known strictly increasing continuous functions and under the assumptions from Neykov et al. (2016), the least squares Lasso algorithm can recover the support of true features correctly for discrete response . The proof and more mathematical details can be found in Neykov et al. (2016). Based on Theorem 4, we have theoretical guarantees for support recovery for some of our running averages-based online methods in classification.

5 Experiments

In this section we evaluate the performance of our proposed algorithms and compare them with offline learning methods and some standard stochastic algorithms. First, we present the results of numerical experiments on synthetic data, comparing the performance on feature selection and prediction. We also provide regret plots for the running averages based algorithms and compare them with classical online algorithms. Finally, we present an evaluation on real data. All simulation experiments are run on a desktop computer with Core i5 - 4460S CPU and 16Gb memory.

5.1 Experiments for Simulated Data

Here, we generate the simulated data with uniformly correlated predictors: given a scalar , we generate , then we set

Finally we obtain the data matrix . It is easy to verify that the correlation between any pair of predictors is . We set in our experiments, thus the correlation between any two variables is 0.5. Given , the dependent response is generated from the following linear models, for regression and respectively classification,

(20)
(21)

where is a -dimensional sparse parameter vector. The true coefficients except , , where is signal strength value. Observe that the classification data cannot be perfectly separated by a linear model.

Variable Detection Rate (%) RMSE Time (s)
Lasso TSGD SADMM OLSth OFSA OMCP OElnet Lasso TSGD SADMM OLSth OFSA OMCP OElnet Lasso TSGD SADMM OLSth OFSA OMCP OElnet RAVE
, strong signal
32.14 11.22 18.10 77.40 99.81 73.71 32.12 11.63 23.15 95.05 5.592 1.136 6.282 11.61 4.332 0.007 5.326 0.052 0.289 15.49 9.648 0.026
46.05 11.22 41.23 100 100 98.02 45.19 9.464 13.45 93.50 1.017 1.017 1.745 9.557 26.91 0.019 15.73 0.051 0.288 13.86 7.113 0.076
72.40 11.22 65.78 100 100 100 72.42 6.07 13.34 94.92 1.003 1.003 1.003 6.042 47.32 0.065 51.80 0.051 0.288 6.508 5.885 0.246
, weak signal
14.09 10.89 13.53 10.11 12.40 15.55 14.08 1.128 1.027 1.363 1.069 1.169 1.049 1.124 5.353 0.006 6.703 0.052 0.288 13.20 9.741 0.026
31.58 10.89 19.80 22.48 32.47 32.32 31.54 1.009 1.007 1.370 1.025 1.006 1.005 1.006 48.13 0.067 67.82 0.051 0.287 14.98 4.961 0.249
81.93 10.89 11.30 80.55 85.14 84.86 81.80 1.001 1.010 1.382 1.003 1.003 1.003 1.003 452.2 0.672 679.7 0.051 0.287 15.93 5.120 2.458
98.66 10.89 10.80 98.94 99.27 99.26 98.71 0.999 1.008 1.383 0.998 0.998 0.998 0.998 1172 2.001 2044 0.051 0.287 13.96 3.749 7.326
- 10.89 - 100 100 100 100 - 1.005 - 0.996 0.996 0.996 0.996 - 6.651 - 0.051 0.288 7.352 1.726 24.36
Table 2: Comparison between running averages method and the other online and offline methods for regression, averaged 100 runs

The simulation is based on the following data parameter setting: and . We consider the signal strength (weak and strong signals). The sample size varies from 1000 to for both regression and classification settings. For regression, we compare with our algorithms with SADMM(Ouyang et al., 2013) and the offline Lasso (Tibshirani, 1996). We also implemented the following truncated stochastic gradient descent (TSGD) (Fan et al., 2018; Wu et al., 2017):

where the operator keeps the largest .

For classification, we cover four methods for comparison: the OPG (Duchi and Singer, 2009) and RDA(Xiao, 2010) frameworks for elastic net, the first order online feature selection (FOFS) method(Wu et al., 2017) and the second order online feature selection (SOFS) method(Wu et al., 2017).

For each method, the sparsity controlling parameter is tuned to obtain variables. This can be done directly for OFSA and OLSth, and indirectly through the penalty parameter for the other methods. In RDA, OPG and SADMM, we used 200 values of on an exponential grid and chose the that induces the non-zero features, where is the largest number of non-zeros features smaller than or equal to , the number of true features.

The following criteria are evaluated in the numerical experiments: the true variable detection rate (DR), the root of mean square error (RMSE) on the test data for regression, the area under ROC curve (AUC) on the test data in classification setting, and the running time (Time) of the algorithms.

The variable detection rate DR is defined as the average number of true variables that are correctly detected by an algorithm divided by the number of true variables. So if is the set of detected variables and are the true variables, then

The results are presented in Tables 2 and 3. We replicate the experiments 100 times and present the average results. Compared to the batch learning method Lasso, in regression, the running averages online methods enjoy low memory complexity. Also, the larger datasets cannot fit in memory, hence we cannot obtain the experimental results for Lasso for the large datasets. In our methods, we input the running averages rather than the data matrix. The memory complexity for running averages is , which is better than for batch learning in the setting of .

From the numerical experiments, we can draw the conclusion that none of the online methods we tested (RDA, OPG, SADMM, FOFS and SOFS) performs very well in true feature recovery. Only the offline Lasso and the proposed running averages based online methods can recover the true signal with high probability. When the signal is weak (), although the running averages methods need a large sample size to recover the weak true signal, they outperform the batch learning methods and the other online methods in our experiment.

In prediction, most methods do well except in regression the existing methods (Lasso, TSGD and SADMM) don’t work well when the signal is strong. In contrast, the proposed running averages perform very well in prediction regardless whether the signal is weak or strong, in both regression and classification.

Finally, we know that the computational complexity for obtaining the model from the running averages does not depend on the sample size , but the time to update the running averages, shown as RAVE in Tables 2 and 3, does increase linearly with . Indeed, we observe in Tables 2 and 3 that the running time of OFSA and OLSth does not have significant changes. However, because of the need to tune the penalty parameters in OLasso, OElnet, and OMCP, it takes more time to run these algorithms. The computational complexity for traditional online algorithms will increase with sample size . This is especially true for OPG, RDA, and SADMM, which take a large amount of time to tune the parameters to select features. When the sample size is very large, running these algorithms takes more than a day.

Variable Detection Rate (%) AUC Time (s)
n FOFS SOFS OPG RDA OFSA OLSth OLasso OMCP FOFS SOFS OPG RDA OFSA OLSth OLasso OMCP FOFS SOFS OPG RDA OFSA OLSth OLasso OMCP RAVE
, strong signal
10.64 10.19 10.46 10.97 38.89 30.30 34.70 41.54 0.995 0.992 0.992 0.990 0.995 0.990 0.996 0.996 0.001 0.001 0.490 0.848 0.005 0.001 0.080 0.160 0.247
10.64 9.95 10.42 10.34 67.67 59.32 56.18 67.52 0.994 0.992 0.992 0.989 0.998 0.996 0.997 0.998 0.003 0.004 1.471 2.210 0.005 0.001 0.083 0.158 0.742
10.64 9.95 10.43 11.08 94.95 93.21 86.90 94.77 0.994 0.992 0.992 0.990 1.000 1.000 0.999 1.000 0.010 0.015 4.900 6.118 0.005 0.001 0.079 0.159 2.478
, strong signal
13.40 10.19 10.00 10.37 19.41 15.93 22.55 23.81 0.827 0.829 0.828 0.828 0.824 0.815 0.829 0.830 0.001 0.001 0.494 0.815 0.005 0.001 0.073 0.148 0.249
15.86 9.95 10.23 10.34 34.46 27.35 35.14 37.70 0.827 0.829 0.829 0.829 0.831 0.827 0.832 0.832 0.003 0.004 1.481 2.093 0.005 0.001 0.074 0.152 0.743
17.36 9.95 10.32 10.91 64.84 56.42 61.07 64.95 0.830 0.831 0.831 0.830 0.834 0.833 0.834 0.834 0.010 0.015 4.935 5.827 0.005 0.001 0.078 0.161 2.472
17.13 9.23 10.32 10.37 91.55 88.91 88.69 91.58 0.826 0.828 0.828 0.827 0.833 0.833 0.833 0.833 0.030 0.044 14.81 17.31 0.005 0.001 0.073 0.164 7.446
17.72 9.91 - - 99.97 99.94 99.88 99.97 0.828 0.829 - - 0.834 0.834 0.834 0.834 0.100 0.146 - - 0.005 0.001 0.039 0.110 24.85
Table 3: Comparison between running averages methods and the other online methods for classification, averaged 100 runs

5.2 Evaluation of Theoretical Bounds

In this section we conduct a series of experiments to compare the theoretical bounds we obtained in section 4 with the reality obtained from simulations.

5.2.1 True Feature Recovery Analysis

In this section we experimentally evaluate the tightness of the bounds for OLS-th from Proposition 4 and Theorem 4. For that, we use the regression data from Section 5.1 and find the experimental such that all variables are correctly detected in at least of runs and compare it with the corresponding bounds given by Equations (14) and (15). In most cases we used in Equations (14) and (15). However, when , we chose as low as to obtain a theoretical probability of at least in Proposition 4 and Theorem 4.

Figure 3: Comparison of the experimental for OLS-th with the bounds from Eq. (14) of Proposition 4 and Eq. (15) of Theorem 4. Left: vs. , for . Middle: vs. , for . Right: