A Novel Framework for Online Supervised Learning with Feature Selection
Abstract
Current online learning methods suffer issues such as lower convergence rates and limited capability to recover the support of the true features compared to their offline counterparts. In this paper, we present a novel framework for online learning based on running averages and introduce a series of online versions of some popular existing offline methods such as Elastic Net, Minimax Concave Penalty and Feature Selection with Annealing. We prove the equivalence between our online methods and their offline counterparts and give theoretical true feature recovery and convergence guarantees for some of them. In contrast to the existing online methods, the proposed methods can extract models with any desired sparsity level at any time. Numerical experiments indicate that our new methods enjoy high accuracy of true feature recovery and a fast convergence rate, compared with standard online and offline algorithms. We also show how the running averages framework can be used for model adaptation in the presence of model drift. Finally, we present some applications to large datasets where again the proposed framework shows competitive results compared to popular online and offline algorithms.
Sun and Barbu \firstpageno1
online learning, feature selection, model adaptation
1 Introduction
Online learning is one of the most promising approaches to efficiently handle large scale machine learning problems. Nowadays, the datasets from various areas such as bioinformatics, medical imaging and computer vision are rapidly increasing in size, and one often encounters datasets so large that they cannot fit in the computer memory. Online methods are capable of addressing these issues by constructing the model sequentially, one example at a time. A comprehensive survey of the online learning and online optimization literature has been presented in Hazan (2016).
In this paper, we assume that a sequence of i.i.d observations are generated from an unknown distribution, and the goal is to minimize a loss function
(1) 
where is a perexample loss function.
In online learning, the coefficient is estimated sequentially, from is obtained a coefficient vector . In the theoretical analysis of online learning, it is of interest to obtain an upper bound of the regret,
(2) 
which measures what is lost compared to an offline optimization algorithm, and in a way measuring the speed of convergence of the online algorithms.
Traditional online algorithms are all designed based on a sequential procedure. Zinkevich (2003) proved that under the assumption that is Lipschitzcontinuous and convex w.r.t , the regret enjoys the upper bound of . Furthermore, if is a strongly convex function, Hazan et al. (2007) showed that the regret has the logarithmic upper bound of .
However, traditional online algorithms have some limitations. Firstly, they cannot access the full gradient to update the parameter vector in each iteration. Online methods are sequential methods, using one observation or a minibatch for acceleration (Cotter et al., 2011) in each iteration. As a consequence, online algorithms suffer a lower convergence rate than traditional batch learning algorithms, for general convexity and for strongly convex functions (ShalevShwartz and BenDavid, 2014). In comparison, offline gradient descent enjoys the convergence rate of . More importantly, the standard online algorithms, such as stochastic gradient descent, are not able to exploit the sparse structure of the feature vector, i.e. they cannot select features and recover the support of the true signal.
In this paper, we introduce a new framework for online learning, related to the statistical query model (Kearns, 1998; Chu et al., 2007). We will give more details about our new framework in Section 2.
1.1 Related Work
Online optimization and regularization. To cope with high dimensional data (e.g. ), various feature selection methods have been proposed to exploit the sparse structure of the coefficient vector. For instance, the  regularization has been widely used in linear regression as a sparsity inducing penalty. Also, several algorithms were designed to solve the feature selection problem in the online scenario. For online convex optimization, there are two main lines of research. One is the ForwardBackwardSplitting method (Duchi and Singer, 2009), building a framework for online proximal gradient (OPG). The other one is Xiao’s Regularized Dual Averaging method (RDA) (Xiao, 2010), which extended the primaldual subgradient method from Nesterov (2009) to the online case. In addition, some online variants are developed in recent years, such as OPGADMM and RDAADMM in Suzuki (2013). Independently, Ouyang designed stochastic ADMM in Ouyang et al. (2013), the same algorithm as OPGADMM. Besides, truncated online gradient descent and truncated second order methods are proposed in Fan et al. (2018); Langford et al. (2009); Wu et al. (2017).
There is another line of research about online feature selection in the high dimensional case. In Yang et al. (2016), a new framework for online learning is proposed in which features arrive one by one, instead of observations, and we need to decide what features to retain. Unlike the traditional online learning, the disadvantage of this new online scenario is we cannot build a model for prediction until all relevant features are disclosed. In this paper, we assume that one can access observations sequentially with time, so we will not cover algorithms such as Yang et al. (2016) for comparison.
In Hazan et al. (2007), an online Newton method was proposed, which used a similar idea with the running averages to update the inverse of the Hessian matrix. This method enjoys the computational complexity , but did not address the issues of variable standardization and feature selection.
1.2 Our Contributions
In this paper, we bring the following contributions:

we introduce a new framework for online learning based on the statistical query model (Kearns, 1998; Chu et al., 2007), and we call the methods under our framework as running averages methods. Many of the methods proposed in our framework enjoy a fast convergence rate and can recover the support of the true signal. Moreover, the proposed methods can address the issue of model selection, which is to obtain models with different sparsity levels and decide on the best model, e.g. using an AIC/BIC criterion. For example in Figure 1 are shown the solution paths obtained by the proposed online least squares with thresholding method, as well as the proposed online Lasso method.

we prove that the online versions of the algorithms in our framework are equivalent to their offline counterparts, therefore bringing forward all the theoretical guarantees existent in the literature for the corresponding offline methods.

we prove convergence and true feature recovery bounds for OLS with thresholding and FSA, and we prove a regret bound for OLS with thresholding.

we conduct extensive experiments on real and simulated data in both regression and classification to verify the theoretical bounds and to compare the proposed methods with popular online and offline algorithms.
Memory  Computation  Convergence  Feature  True Feature  
Algorithm  Running Avgs.  Algorithms  Coefficients  Regret  Selection  Recovery  
SGD    Slow  No  No  
SADMM(Ouyang et al., 2013)    Slow  Yes  No  
SIHT(Fan et al., 2018)    Slow  Yes  No  
OFSA  Fast  Yes  Yes  
OLSth  Yes  Yes  
OMCP  Fast  Yes  Yes  
OElnet  Fast  Yes  No 
A brief summary of the convergence rates and computational complexity of various online methods including the proposed methods are shown in Table 1.
Finally, we summarize the advantages and disadvantages of the proposed running averages algorithms: although the proposed online methods based on running averages sacrifice computational complexity and memory compared with classical online methods, they enjoy a fast convergence rate and high estimation accuracy. More importantly, the proposed methods can select features and recover the support of true features with high accuracy and they can obtain models with any desired sparsity level for model selection at any time.
2 Setup and Notation
In this section, we will provide a general framework about running averages. First, we establish notation and problem settings. We denote vectors by lower case bold letters, such as , and scalars by lower case letters, e.g. . A sequence of vectors is denoted by subscripts, i.e. , and the entries in a vector are denoted by nonbold subscripts, like . We use upper case bold letters to denote matrices, such as , and upper case letters for random variables, like . Given a vector , we define vector norms: and .
2.1 Running Averages
The idea of running averages comes from the statistical query model and the issues of standard online methods. In mathematical statistics, given a distribution with unknown parameters and the i.i.d random variables , a sufficient statistic contains all the information necessary for estimating the model parameters.
In big data learning, the large datasets cannot fit in memory, and the online methods in the literature cannot recover the support of true features. Motivated by these concerns, we propose the running averages framework, which contains two modules, a running averages module that is updated online as new data is available, and a model extraction module that can build the model with any desired sparsity from the running averages. A diagram of the framework is shown in Figure 2.
Let , be observations with and , and we denote data matrix , . The running averages are the cumulative averages over the observations. They are
and the sample size . The running averages can be updated in an incremental manner, for example
(3) 
similar to the procedure from Chapter 2.5 in Sutton and Barto (1998).
The running averages have the following advantages: a) they cover all necessary sample information for model estimation, b) the dimension of the running averages will not increase with sample size , c) they can be used in the online learning setting because they can be updated one example at one time.
2.2 Data Standardization
Data standardization is an important procedure in real data analysis, especially for feature selection, because a feature could have an arbitrary scale (unit of measure) and the scale should not influence its importance in the model. For this purpose, the data matrix and the response vector are usually standardized by removing the mean, and is further standardized by bringing all columns to the same scale. However, because we discard the data and only use the running averages, we will need to standardize the running averages.
Denote , and by the sample standard deviation for the random variable . By running averages, we can estimate the standard deviation:
in which is the th diagonal entry of the matrix . Then, denote by the diagonal matrix containing the inverse of standard deviations on the diagonal. Denoting by the standardized data matrix , and as the centralized , the original data can be standardized by
From these equations we obtain the running averages of the standardized dataset:
(4)  
(5) 
For convenience, hereinafter, we will still use and to represent the running averages after standardization.
3 Algorithms
In this section, we propose several running averagesbased online algorithms. First, we design online least squares based on running averages, which can be used for feature selection by thresholding. We also propose the online feature selection with annealing (OFSA) to solve the constrained least squares problem. Then we consider some regularization models, such as Lasso, Elastic Net, and Minimax Concave Penalty. To simplify notation, we denote OLS to represent online least squares, OLSth for online least squares with thresholding, OLasso for online Lasso, OElnet for online elastic net, and OMCP for online minimax concave penalty.
3.1 Preliminaries
Before we start introducing the running averagesbased algorithms, we prove that these online algorithms are equivalent to their offline counterparts. Actually, in our running averages framework, we share the same objective loss function with offline learning, which is the key point to prove their equivalence. {proposition} Consider the following penalized regression problem:
(6) 
in which is the coefficient vector and is a penalty function. It is equivalent to the online optimization problem based on running averages.
(7) 
The loss function (6) can be rewritten as
in which , , and are running averages. Thus, the offline learning problem is equivalent to the running averagesbased optimization.
3.2 Online Least Squares
In OLS, we need to find the solution for the equations . Since and can be computed by using running averages, we obtain:
(8) 
Thus, online least squares is equivalent to offline least squares.
3.3 Online Least Squares with Thresholding
The OLSth is aimed at solving the following constrained minimization problem:
(9) 
It is a nonconvex and NPhard problem because of the sparsity constraint. Here, we propose a three step procedure to solve it: first, we use the online least squares to estimate , then we remove unimportant variables according to the coefficient magnitudes , . Finally, we use least squares to refit the model on the subset of selected features. The prototype algorithm is described in Algorithm 1. In the high dimensional case , we can use the ridge regression estimator in the first step.
3.4 Online Feature Selection with Annealing
Unlike OLSth, OFSA is an iterative thresholding algorithm. The OFSA algorithm can simultaneously solve the coefficient estimation problem and the feature selection problem. The main ideas in OFSA are: 1) uses an annealing plan to lessen the greediness in reducing the dimensionality from to , 2) removes irrelevant variables to facilitate computation. The algorithm starts with an initialized parameter , generally , and then alternates two basic steps: one is updating the parameters to minimize the loss by gradient descent
and the other one is a feature selection step that removes some variables based on the ranking of , . In the second step, we design an annealing schedule to decide the number of features we keep in each time period ,
More details are shown in Barbu et al. (2017) about the offline FSA algorithm, such as applications and theoretical analysis. For the square loss, the computation of
(10) 
falls into our running averages framework. Thus, we derive the OFSA, which is equivalent to the offline FSA from Barbu et al. (2017). The algorithm is summarized in Algorithm 2.
3.5 Online Regularization Methods
Penalized methods can also be used to select features, and we can map them into our running averages framework. A popular one is the Lasso estimator(Tibshirani, 1996), which solves the convex optimization problem
(11) 
in which is a tuning parameter.
Besides Lasso, the SCAD(Fan and Li, 2001), Elastic Net(Zou and Hastie, 2005) and MCP(Zhang, 2010) were proposed to deal with the variable selection and estimation problem. Here, we use the gradientbased method with a thresholding operator to solve the regularized loss minimization problems (She et al., 2009). For instance, in Lasso and Elastic net, is the soft thresholding operator, and in MCP,
(12) 
in which is a constant. The general algorithm is given in Algorithm 3.
3.6 Online Classification Methods
The aforementioned algorithms not only can select features for regression, but can also be used for classification, even though these algorithms are based on the loss. In fact, for the two class problem with labels and , the coefficient vector for classification from linear least squares is proportional to the coefficient vector by linear discriminant analysis without intercept(Friedman et al., 2001). Besides, one can use the Lasso method to select variable for classification under some assumptions(Neykov et al., 2016). We will give the theoretical guarantees in Section 4.
3.7 Memory and Computational Complexity
In general, the memory complexity for the running averages is because is a matrix. The computational complexity of maintaining the running averages is . And except OLSth, the computational complexity for obtaining the model using the running averagebased algorithms is based on the limited number of iterations, each taking time. As for OLSth, it is if done by Gaussian elimination or if done using an iterative method that takes much fewer iterations than . We can conclude that the running averages storage does not depend on the sample size , and the computation is linear in . Hence, when , compared to the batch learning algorithms, the running averages based methods need less memory and have less computational complexity. At the same time, they can achieve the same convergence rate as the batch learning algorithms.
3.8 Model Adaptation
Detecting changes in the underlying model and rapidly adapting to the changes are common problems in online learning, and some applications are based on varyingcoefficient models(Javanmard, 2017). Our running averages online methods can adapt to coefficients change for large scale data streams. For that, the update equation (3) can be regarded in a more general form as
(13) 
where we only show one of the running averages for illustration but the same type of updates are used for all of them.
The original running averages use , which gives all observations equal weight in the running average. For the coefficientsvarying models, we use a larger value of that gives more weight to the recent observations. However, too much adaptation is also not desirable because in that case the model will not be able to recover some weak coefficients that can only be recovered given sufficiently many observations. More details about simulation and application will be covered in Section 5.
4 Theoretical Analysis
In this section we will give the theoretical analysis for our methods. First, because of Prop. 3.1, we have the equivalence of the online penalized models including Lasso, Elastic Net, SCAD and MCP with their offline counterparts, and thus all their theoretical guarantees of convergence, consistency, oracle inequalities, etc., carry over to their online counterparts.
In this section, we will first show that the OLSth and OFSA method can recover the support of the true features with high probability, and then we will provide a regret bound analysis for OLSth. The main idea of our proof come from Yuan et al. (2014). Then we will give a theoretical justification for the support recovery of our method in classification. The proofs are given in the Appendix.
The first result we present is: when the sample size is large enough, OLSth can recover the support of true features with high probability. In our theorem, the data is not normalized and the features do not have the same scale. Thus, we consider the data normalization in our theoretical analysis. Although the intercept is necessary in applications, we do not cover it here. {proposition} Suppose we have the linear model
where is the data matrix, in which , , are independently drawn from . Let and , and
(14) 
Then with probability , the index set of top values of is exactly , where is the OLS estimate. The Proposition 4 shows theoretical guarantee of true feature recovery for OLSth. We can observe that the probability of true feature recovery does not depend on the true sparsity . We will verify it by numerical experiments in the next section. Here, we also give the theoretical guarantees for the data standardization case. {remark} Denote , . Given the conditions , for some satisfying , then with high probability the index set of top values of is exactly , where is the OLS estimate with standardized . {theorem}(True feature recovery for OLSth) With the same notations as Proposition 4, if
(15) 
where is the largest diagonal value of , then with probability the index set of top values of is exactly . Then we consider the theoretical guarantees of true feature recovery for OFSA algorithms. First, we need to give the definition of restricted strong convexity/smoothness. {definition}(Restricted Strong Convexity/Smoothness) For any integer , we say that a differentiable function is restricted strongly convex (RSC) with parameter and restricted strongly smooth (RSS) with parameter if there exist such that
(16) 
In the linear regression case, the RSC/RSS conditions are equivalent to the restricted isometric property (RIP):
(17) 
And in the low dimensional case, the RIP condition will degenerate to
With the same conditions as Proposition 4, let be an arbitrary sparse vector, so . Let be the OFSA coefficient vector at iteration , be its support, and . If is a differentiable function which is convex and smooth, then for any learning rate , we have
where . {theorem}(Convergence of OFSA) With the same assumptions as Proposition 4, let and . Assume we have for any . Let be the diagonal matrix with the true standard deviations of respectively. Then, with the probability , the OFSA coefficient vector satisfies
Because for any and , we have . Thus, we can get that and . Then, by using Proposition 4 recursively, we get the upper bound of the when the dimension of decreases from to . At the time period , we have
and at the time period , we also have
in which and are the number of selected features at time period and , respective. Thus, we have
Because we have , and , we get
Applying the same idea repeatedly all the way to we get
Since , where , we have
For the first term we have
and because is standardized and from Corollary Appendix A. Proofs we get
Therefore, with probability , we have
We can also use Lemma Appendix A. Proofs with and (since ) to get
So, with probability we have
and therefore
Please note that the dimension of the vector will reduce from to , thus we follow the Proposition 4 recursively with varying . Here, we assume that . Now we show that the OFSA algorithm can recover the support of true features with high probability. {corollary}(True feature recovery for OFSA) Under the conditions of Theorem 4, let
Then after iterations, the OFSA algorithm will output satisfying with probability . Finally, we consider the regret bound for the OLS and OLSth algorithms. In fact, all the feature selection algorithms we mentioned will degenerate to OLS if the true features are selected. First, we define the regret for a sparse model with sparsity levels :
(18) 
in which is the coefficient vector at step and .
Observe that for , the loss functions from (18) are twice continuously differentiable. We denote and . We will need the following assumptions:
Assumption 1
Given , then we have satisfy
Assumption 2
Given , there exist constants and such that and .
(Regret of OLS) Given , under Assumptions 1 and 2, the regret of OLS satisfies:
(Regret of OLSth) With the Assumptions 1, 2 holding for , there exists a constant such that if satisfies
(19) 
where , then with probability at least the regret of OLSth satisfies:
Theoretical guarantees for feature selection in classification. Proposition 2.3.6 and Remark 2.3.7 from Neykov et al. (2016) show that the least squares Lasso algorithm (therefore the Online Lasso) can recover the support of true variables for the discrete under some assumptions. {theorem}(True support recovery) Consider the special case of a single index model, , in which and satisfies the irrepresentable condition. If , are known strictly increasing continuous functions and under the assumptions from Neykov et al. (2016), the least squares Lasso algorithm can recover the support of true features correctly for discrete response . The proof and more mathematical details can be found in Neykov et al. (2016). Based on Theorem 4, we have theoretical guarantees for support recovery for some of our running averagesbased online methods in classification.
5 Experiments
In this section we evaluate the performance of our proposed algorithms and compare them with offline learning methods and some standard stochastic algorithms. First, we present the results of numerical experiments on synthetic data, comparing the performance on feature selection and prediction. We also provide regret plots for the running averages based algorithms and compare them with classical online algorithms. Finally, we present an evaluation on real data. All simulation experiments are run on a desktop computer with Core i5  4460S CPU and 16Gb memory.
5.1 Experiments for Simulated Data
Here, we generate the simulated data with uniformly correlated predictors: given a scalar , we generate , then we set
Finally we obtain the data matrix . It is easy to verify that the correlation between any pair of predictors is . We set in our experiments, thus the correlation between any two variables is 0.5. Given , the dependent response is generated from the following linear models, for regression and respectively classification,
(20)  
(21) 
where is a dimensional sparse parameter vector. The true coefficients except , , where is signal strength value. Observe that the classification data cannot be perfectly separated by a linear model.
Variable Detection Rate (%)  RMSE  Time (s)  
Lasso  TSGD  SADMM  OLSth  OFSA  OMCP  OElnet  Lasso  TSGD  SADMM  OLSth  OFSA  OMCP  OElnet  Lasso  TSGD  SADMM  OLSth  OFSA  OMCP  OElnet  RAVE  
, strong signal  
32.14  11.22  18.10  77.40  99.81  73.71  32.12  11.63  23.15  95.05  5.592  1.136  6.282  11.61  4.332  0.007  5.326  0.052  0.289  15.49  9.648  0.026  
46.05  11.22  41.23  100  100  98.02  45.19  9.464  13.45  93.50  1.017  1.017  1.745  9.557  26.91  0.019  15.73  0.051  0.288  13.86  7.113  0.076  
72.40  11.22  65.78  100  100  100  72.42  6.07  13.34  94.92  1.003  1.003  1.003  6.042  47.32  0.065  51.80  0.051  0.288  6.508  5.885  0.246  
, weak signal  
14.09  10.89  13.53  10.11  12.40  15.55  14.08  1.128  1.027  1.363  1.069  1.169  1.049  1.124  5.353  0.006  6.703  0.052  0.288  13.20  9.741  0.026  
31.58  10.89  19.80  22.48  32.47  32.32  31.54  1.009  1.007  1.370  1.025  1.006  1.005  1.006  48.13  0.067  67.82  0.051  0.287  14.98  4.961  0.249  
81.93  10.89  11.30  80.55  85.14  84.86  81.80  1.001  1.010  1.382  1.003  1.003  1.003  1.003  452.2  0.672  679.7  0.051  0.287  15.93  5.120  2.458  
98.66  10.89  10.80  98.94  99.27  99.26  98.71  0.999  1.008  1.383  0.998  0.998  0.998  0.998  1172  2.001  2044  0.051  0.287  13.96  3.749  7.326  
  10.89    100  100  100  100    1.005    0.996  0.996  0.996  0.996    6.651    0.051  0.288  7.352  1.726  24.36 
The simulation is based on the following data parameter setting: and . We consider the signal strength (weak and strong signals). The sample size varies from 1000 to for both regression and classification settings. For regression, we compare with our algorithms with SADMM(Ouyang et al., 2013) and the offline Lasso (Tibshirani, 1996). We also implemented the following truncated stochastic gradient descent (TSGD) (Fan et al., 2018; Wu et al., 2017):
where the operator keeps the largest .
For classification, we cover four methods for comparison: the OPG (Duchi and Singer, 2009) and RDA(Xiao, 2010) frameworks for elastic net, the first order online feature selection (FOFS) method(Wu et al., 2017) and the second order online feature selection (SOFS) method(Wu et al., 2017).
For each method, the sparsity controlling parameter is tuned to obtain variables. This can be done directly for OFSA and OLSth, and indirectly through the penalty parameter for the other methods. In RDA, OPG and SADMM, we used 200 values of on an exponential grid and chose the that induces the nonzero features, where is the largest number of nonzeros features smaller than or equal to , the number of true features.
The following criteria are evaluated in the numerical experiments: the true variable detection rate (DR), the root of mean square error (RMSE) on the test data for regression, the area under ROC curve (AUC) on the test data in classification setting, and the running time (Time) of the algorithms.
The variable detection rate DR is defined as the average number of true variables that are correctly detected by an algorithm divided by the number of true variables. So if is the set of detected variables and are the true variables, then
The results are presented in Tables 2 and 3. We replicate the experiments 100 times and present the average results. Compared to the batch learning method Lasso, in regression, the running averages online methods enjoy low memory complexity. Also, the larger datasets cannot fit in memory, hence we cannot obtain the experimental results for Lasso for the large datasets. In our methods, we input the running averages rather than the data matrix. The memory complexity for running averages is , which is better than for batch learning in the setting of .
From the numerical experiments, we can draw the conclusion that none of the online methods we tested (RDA, OPG, SADMM, FOFS and SOFS) performs very well in true feature recovery. Only the offline Lasso and the proposed running averages based online methods can recover the true signal with high probability. When the signal is weak (), although the running averages methods need a large sample size to recover the weak true signal, they outperform the batch learning methods and the other online methods in our experiment.
In prediction, most methods do well except in regression the existing methods (Lasso, TSGD and SADMM) don’t work well when the signal is strong. In contrast, the proposed running averages perform very well in prediction regardless whether the signal is weak or strong, in both regression and classification.
Finally, we know that the computational complexity for obtaining the model from the running averages does not depend on the sample size , but the time to update the running averages, shown as RAVE in Tables 2 and 3, does increase linearly with . Indeed, we observe in Tables 2 and 3 that the running time of OFSA and OLSth does not have significant changes. However, because of the need to tune the penalty parameters in OLasso, OElnet, and OMCP, it takes more time to run these algorithms. The computational complexity for traditional online algorithms will increase with sample size . This is especially true for OPG, RDA, and SADMM, which take a large amount of time to tune the parameters to select features. When the sample size is very large, running these algorithms takes more than a day.
Variable Detection Rate (%)  AUC  Time (s)  
n  FOFS  SOFS  OPG  RDA  OFSA  OLSth  OLasso  OMCP  FOFS  SOFS  OPG  RDA  OFSA  OLSth  OLasso  OMCP  FOFS  SOFS  OPG  RDA  OFSA  OLSth  OLasso  OMCP  RAVE 
, strong signal  
10.64  10.19  10.46  10.97  38.89  30.30  34.70  41.54  0.995  0.992  0.992  0.990  0.995  0.990  0.996  0.996  0.001  0.001  0.490  0.848  0.005  0.001  0.080  0.160  0.247  
10.64  9.95  10.42  10.34  67.67  59.32  56.18  67.52  0.994  0.992  0.992  0.989  0.998  0.996  0.997  0.998  0.003  0.004  1.471  2.210  0.005  0.001  0.083  0.158  0.742  
10.64  9.95  10.43  11.08  94.95  93.21  86.90  94.77  0.994  0.992  0.992  0.990  1.000  1.000  0.999  1.000  0.010  0.015  4.900  6.118  0.005  0.001  0.079  0.159  2.478  
, strong signal  
13.40  10.19  10.00  10.37  19.41  15.93  22.55  23.81  0.827  0.829  0.828  0.828  0.824  0.815  0.829  0.830  0.001  0.001  0.494  0.815  0.005  0.001  0.073  0.148  0.249  
15.86  9.95  10.23  10.34  34.46  27.35  35.14  37.70  0.827  0.829  0.829  0.829  0.831  0.827  0.832  0.832  0.003  0.004  1.481  2.093  0.005  0.001  0.074  0.152  0.743  
17.36  9.95  10.32  10.91  64.84  56.42  61.07  64.95  0.830  0.831  0.831  0.830  0.834  0.833  0.834  0.834  0.010  0.015  4.935  5.827  0.005  0.001  0.078  0.161  2.472  
17.13  9.23  10.32  10.37  91.55  88.91  88.69  91.58  0.826  0.828  0.828  0.827  0.833  0.833  0.833  0.833  0.030  0.044  14.81  17.31  0.005  0.001  0.073  0.164  7.446  
17.72  9.91      99.97  99.94  99.88  99.97  0.828  0.829      0.834  0.834  0.834  0.834  0.100  0.146      0.005  0.001  0.039  0.110  24.85 
5.2 Evaluation of Theoretical Bounds
In this section we conduct a series of experiments to compare the theoretical bounds we obtained in section 4 with the reality obtained from simulations.
5.2.1 True Feature Recovery Analysis
In this section we experimentally evaluate the tightness of the bounds for OLSth from Proposition 4 and Theorem 4. For that, we use the regression data from Section 5.1 and find the experimental such that all variables are correctly detected in at least of runs and compare it with the corresponding bounds given by Equations (14) and (15). In most cases we used in Equations (14) and (15). However, when , we chose as low as to obtain a theoretical probability of at least in Proposition 4 and Theorem 4.