Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP

# Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP

Satyen Kale
Google Research, New York
This work was done while the author was at Yahoo Research, New York.
Zohar Karnin
Amazon, New York
zkarnin@gmail.com
Tengyuan Liang
University of Chicago
Booth School of Business
Tengyuan.Liang@chicagobooth.edu
Dávid Pál
Yahoo Research, New York
dpal@yahoo-inc.com
###### Abstract

Online sparse linear regression is an online problem where an algorithm repeatedly chooses a subset of coordinates to observe in an adversarially chosen feature vector, makes a real-valued prediction, receives the true label, and incurs the squared loss. The goal is to design an online learning algorithm with sublinear regret to the best sparse linear predictor in hindsight. Without any assumptions, this problem is known to be computationally intractable. In this paper, we make the assumption that data matrix satisfies restricted isometry property, and show that this assumption leads to computationally efficient algorithms with sublinear regret for two variants of the problem. In the first variant, the true label is generated according to a sparse linear model with additive Gaussian noise. In the second, the true label is chosen adversarially.

## 1 Introduction

In modern real-world sequential prediction problems, samples are typically high dimensional, and construction of the features may itself be a computationally intensive task. Therefore in sequential prediction, due to the computation and resource constraints, it is preferable to design algorithms that compute only a limited number of features for each new data example. One example of this situation, from (Cesa-Bianchi et al., 2011), is medical diagnosis of a disease, in which each feature is the result of a medical test on the patient. Since it is undesirable to subject a patient to a battery of medical tests, we would like to adaptively design diagnostic procedures that rely on only a few, highly informative tests.

Online sparse linear regression (OSLR) is a sequential prediction problem in which an algorithm is allowed to see only a small subset of coordinates of each feature vector. The problem is parameterized by 3 positive integers: , the dimension of the feature vectors, , the sparsity of the linear regressors we compare the algorithm’s performance to, and , a budget on the number of features that can be queried in each round by the algorithm. Generally we have and but not significantly larger (our algorithms need111In this paper, we use the notation to suppress factors that are polylogarithmic in the natural parameters of the problem. ).

In the OSLR problem, the algorithm makes predictions over a sequence of rounds. In each round , nature chooses a feature vector , the algorithm chooses a subset of of size at most and observes the corresponding coordinates of the feature vector. It then makes a prediction based on the observed features, observes the true label , and suffers loss . The goal of the learner is to make the cumulative loss comparable to that of the best -sparse linear predictor in hindsight. The performance of the online learner is measured by the regret, which is defined as the difference between the two losses:

 RegretT=T∑t=1(yt−ˆyt)2−minw: ∥w∥0≤kT∑t=1(yt−⟨xt,w⟩)2.

The goal is to construct algorithms that enjoy regret that is sub-linear in , the total number of rounds. A sub-linear regret implies that in the asymptotic sense, the average per-round loss of the algorithm approaches the average per-round loss of the best -sparse linear predictor.

Sparse regression is in general a computationally hard problem. In particular, given , and as inputs, the offline problem of finding a -sparse that minimizes the error does not admit a polynomial time algorithm under standard complexity assumptions Foster et al. (2015). This hardness persists even under the assumption that there exists a -sparse such that for all . Furthermore, the computational hardness is present even when the solution is required to be only -sparse solution and has to minimize the error only approximately; see Foster et al. (2015) for details. The hardness result was extended to online sparse regression by Foster et al. (2016). They showed that for all there exists no polynomial-time algorithm with regret unless .

Foster et al. (2016) posed the open question of what additional assumptions can be made on the data to make the problem tractable. In this paper, we answer this open question by providing efficient algorithms with sublinear regret under the assumption that the matrix of feature vectors satisfies the restricted isometry property (RIP) (Candes and Tao, 2005). It has been shown that if RIP holds and there exists a sparse linear predictor such that where is independent noise, the offline sparse linear regression problem admits computationally efficient algorithms, e.g., Candes and Tao (2007). RIP and related Restricted Eigenvalue Condition (Bickel et al., 2009) have been widely used as a standard assumption for theoretical analysis in the compressive sensing and sparse regression literature, in the offline case. In the online setting, it is natural to ask whether sparse regression avoids the computational difficulty under an appropriate form of the RIP condition. In this paper, we answer this question in a positive way, both in the realizable setting and in the agnostic setting. As a by-product, we resolve the adaptive feature selection problem as the efficient algorithms we propose in this paper adaptively choose a different “sparse” subset of features to query at each round. This is closely related to attribute-efficient learning (see discussion in Section 1.2) and online model selection.

### 1.1 Summary of Results

We design polynomial-time algorithms for online sparse linear regression for two models for the sequence . The first model is called the realizable and the second is called agnostic. In both models, we assume that, after proper normalization, for all large enough , the matrix formed from the first feature vectors satisfies the restricted isometry property. The two models differ in the assumptions on . The realizable model assumes that where is -sparse and is an independent noise. In the agnostic model, can be arbitrary, and therefore, the regret bounds we obtain are worse than in the realizable setting. The models and corresponding algorithms are presented in Sections 2 and 3 respectively. Interestingly enough, the algorithms and their corresponding analyses are completely different in the realizable and agnostic case.

Our algorithms allow for somewhat more flexibility than the problem definition: they are designed to work with a budget on the number of features that can be queried that may be larger than the sparsity parameter of the comparator. The regret bounds we derive improve with increasing values of . In the case when , the dependence on in the regret bounds is polynomial, as can be expected in limited feedback settings (this is analogous to polynomial dependence on in bandit settings). In the extreme case when , i.e. we have access to all the features, the dependence on the dimension in the regret bounds we prove is only logarithmic. The interpretation is that if we have full access to the features, but the goal is to compete with just sparse linear regressors, then the number of data points that need to be seen to achieve good predictive accuracy has only logarithmic dependence on . This is analogous to the (offline) compressed sensing setting where the sample complexity bounds, under RIP, only depend logarithmically on .

A major building block in the solution for the realizable setting (Section 2) consists of identifying the best -sparse linear predictor for the past data at any round in the prediction problem. This is done by solving a sparse regression problem on the observed data. The solution of this problem cannot be obtained by a simple application of say, the Dantzig selector (Candes and Tao, 2007) since we do not observe the data matrix , but rather a subsample of its entries. Our algorithm is a variant of the Dantzig selector that incorporates random sampling into the optimization, and computes a near-optimal solution by solving a linear program. The resulting algorithm has a regret bound of . This bound has optimal dependence on , since even in the full information setting where all features are observed there is a lower bound of (Hazan and Kale, 2014).

The algorithm for the agnostic setting relies on the theory of submodular optimization. The analysis in (Boutsidis et al., 2015) shows that the RIP assumption implies that the set function defined as the minimum loss achievable by a linear regressor restricted to the set in question satisfies a property called weak supermodularity. Weak supermodularity is a relaxation of standard supermodularity that is still strong enough to show performance bounds for the standard greedy feature selection algorithm for solving the sparse regression problem. We then employ a technique developed by Streeter and Golovin (2008) to construct an online learning algorithm that mimics the greedy feature selection algorithm. The resulting algorithm has a regret bound of . It is unclear if this bound has the optimal dependence on : it is easy to prove a lower bound of on the regret using standard arguments for the multiarmed bandit problem.

### 1.2 Related work

A related setting is attribute-efficient learning (Cesa-Bianchi et al., 2011; Hazan and Koren, 2012; Kukliansky and Shamir, 2015). This is a batch learning problem in which the examples are generated i.i.d., and the goal is to simply output a linear regressor using only a limited number of features per example with bounded excess risk compared to the optimal linear regressor, when given full access to the features at test time. Since the goal is not prediction but simply computing the optimal linear regressor, efficient algorithms exist and have been developed by the aforementioned papers.

Without any assumptions, only inefficient algorithms for the online sparse linear regression problem are known Zolghadr et al. (2013); Foster et al. (2016). Kale (2014) posed the open question of whether it is possible to design an efficient algorithm for the problem with a sublinear regret bound. This question was answered in the negative by Foster et al. (2016), who showed that efficiency can only be obtained under additional assumptions on the data. This paper shows that the RIP assumption yields tractability in the online setting just as it does in the batch setting.

In the realizable setting, the linear program at the heart of the algorithm is motivated from Dantzig selection Candes and Tao (2007) and error-in-variable regression Rosenbaum and Tsybakov (2010); Belloni et al. (2016). The problem of finding the best sparse linear predictor when only a sample of the entries in the data matrix is available is also discussed by Belloni et al. (2016) (see also the references therein). In fact, these papers solve a more general problem where we observe a matrix rather than that is an unbiased estimator of . While we can use their results in a black-box manner, they are tailored for the setting where the variance of each is constant and it is difficult to obtain the exact dependence on this variance in their bounds. In our setting, this variance can be linear in the dimension of the feature vectors, and hence we wish to control the dependence on the variance in the bounds. Thus, we use an algorithm that is similar to the one in Belloni et al. (2016), and provide an analysis for it (in the appendix). As an added bonus, our algorithm results in solving a linear program rather than a conic or general convex program, hence admits a solution that is more computationally efficient.

In the agnostic setting, the computationally efficient algorithm we propose is motivated from (online) supermodular optimization (Natarajan, 1995; Boutsidis et al., 2015; Streeter and Golovin, 2008). The algorithm is computationally efficient and enjoys sublinear regret under an RIP-like condition, as we will show in Section 3. This result can be contrasted with the known computationally prohibitive algorithms for online sparse linear regression (Zolghadr et al., 2013; Foster et al., 2016), and the hardness result without RIP (Foster et al., 2015, 2016).

### 1.3 Notation and Preliminaries

For , we denote by the set . For a vector in , denote by its -th coordinate. For a subset , we use the notation to indicate the vector space spanned by the coordinate axes indexed by (i.e. the set of all vectors supported on the set ). For a vector , denote by the projection of on . That is, the coordinates of are

 x(S)(i)={x(i)if i∈S,0if i∉S,for i=1,2,…,d.

Let be the inner product of vectors and .

For , the -norm of a vector is denoted by . For , , , and is the number of non-zero coordinates of .

The following definition will play a key role:

###### Definition 1 (Restricted Isometry Property Candes and Tao (2007)).

Let and . We say that a matrix satisfies restricted isometry property (RIP) with parameters if for any with we have

 (1−ϵ)∥w∥2≤1√n∥Xw∥2≤(1+ϵ)∥w∥2.

One can show that RIP holds with overwhelming probability if and each row of the matrix is sampled independently from an isotropic sub-Gaussian distribution. In the realizable setting, the sub-Gaussian assumption can be relaxed to incorporate heavy tail distribution via the “small ball” analysis introduced in Mendelson (2014), since we only require one-sided lower isometry property.

### 1.4 Proper Online Sparse Linear Regression

We introduce a variant of online sparse regression (OSLR), which we call proper online sparse linear regression (POSLR). The adjective “proper” is to indicate that the algorithm is required to output a weight vector in each round and its prediction is computed by taking an inner product with the feature vector.

We assume that there is an underlying sequence of labeled examples in . In each round , the algorithm behaves according to the following protocol:

1. Choose a vector such that .

2. Choose of size at most .

3. Observe and , and incur loss .

Essentially, the algorithm makes the prediction in round . The regret after rounds of an algorithm with respect to is

 RegretT(w)=T∑t=1(yt−⟨xt,wt⟩)2−T∑t=1(yt−⟨xt,w⟩)2.

The regret after rounds of an algorithm with respect to the best -sparse linear regressor is defined as

 RegretT=maxw: ∥w∥0≤kRegretT(w).

Note that any algorithm for POSLR gives rise to an algorithm for OSLR. Namely, if an algorithm for POSLR chooses and , the corresponding algorithm for OSLR queries the coordinates . The algorithm for OSLR queries at most coordinates and has the same regret as the algorithm for POSLR.

Additionally, POSLR allows parameters settings which do not have corresponding counterparts in OSLR. Namely, we can consider the sparse “full information” setting where and .

We denote by the matrix of first unlabeled samples i.e. rows of are . Similarly, we denote by the vector of first labels . We use the shorthand notation , for and respectively.

In order to get computationally efficient algorithms, we assume that that for all , the matrix satisfies the restricted isometry condition. The parameter and RIP parameters will be specified later.

## 2 Realizable Model

In this section we design an algorithm for POSLR for the realizable model. In this setting we assume that there is a vector such that and the sequence of labels is generated according to the linear model

 yt=⟨xt,w∗⟩+ηt, (1)

where are independent random variables from . We assume that the standard deviation , or an upper bound of it, is given to the algorithm as input. We assume that and for all .

For convenience, we use to denote the vector of noise variables.

### 2.1 Algorithm

The algorithm maintains an unbiased estimate of the matrix . The rows of are vectors which are unbiased estimates of . To construct the estimates, in each round , the set is chosen uniformly at random from the collection of all subsets of of size . The estimate is

 ˆxt=dk0⋅xt(St). (2)

To compute the predictions of the algorithm, we consider the linear program

 minimize∥w∥1s.t. ∥∥∥1tˆXTt(Yt−ˆXtw)+1tˆDtw∥∥∥∞≤C√dlog(td/δ)tk0(σ+dk0). (3)

Here, is a universal constant, and is the allowed failure probability. , defined in equation (5), is a diagonal matrix that offsets the bias on the .

The linear program (3) is called the Dantzig selector. We denote its optimal solution by . (We define .)

Based on , we construct . Let be the coordinates sorted according to the their absolute value, breaking ties according to their index. Let be the top coordinates. We define as

 ˜wt=ˆwt(˜St). (4)

The actual prediction is either zero if or for some and it gets updated whenever is a power of .

The algorithm queries at most features each round, and the linear program can be solved in polynomial time using simplex method or interior point method. The algorithm solves the linear program only times by using the same vector in the rounds . This lazy update improves both the computational aspects of the algorithm and the regret bound.

### 2.2 Main Result

The main result in this section provides a logarithmic regret bound under the following assumptions 222A more precise statement with the exact dependence on the problem parameters can be found in the appendix.

• The feature vectors have the property that for any , the matrix satisfies the RIP condition with , with .

• The underlying POSLR online prediction problem has a sparsity budget of and observation budget .

• The model is realizable as defined in equation (1) with i.i.d unbiased Gaussian noise with standard deviation .

###### Theorem 2.

For any , with probability at least , Algorithm 1 satisfies

 RegretT=O(k2log(d/δ)(d/k0)3log(T)).

The theorem asserts that an regret bound is efficiently achievable in the realizable setting. Furthermore when the regret scales as meaning that we do not necessarily require to obtain a meaningful result. We note that the complete expression for arbitrary is given in (13) in the appendix.

The algorithm can be easily understood via the error-in-variable equation

 yt =⟨xt,w∗⟩+ηt, ˆxt =xt+ξt.

with , where the expectation is taken over random sampling introduced by the algorithm when performing feature exploration. The learner observes as well as the “noisy” feature vector , and aims to recover .

As mentioned above, we (implicitly) need an unbiased estimator of . By taking it is easy to verify that the off-diagonal entries are indeed unbiased however this is not the case for the diagonal. To this end we define as the diagonal matrix compensating for the sampling bias on the diagonal elements of

 Dt=(dk0−1)⋅diag(XTtXt)

and the estimated bias from the observed data is

 ˆDt=(1−k0d)⋅diag(ˆXTtˆXt). (5)

Therefore, program (1) can be viewed as Dantzig selector with plug-in unbiased estimates for and using limited observed features.

### 2.3 Sketch of Proof

The main building block in proving Theorem 2 is stated in Lemma 3. It proves that the sequence of solutions converges to the optimal response based on which the signal is created. More accurately, ignoring all second order terms, it shows that . In Lemma 4 we show that the same applies for the sparse approximation of . Now, since we get that the difference between our response and the (almost) optimal response is bounded by . Given this, a careful calculation of the difference of losses leads to a regret bound w.r.t. . Specifically, an elementary analysis of the loss expression leads to the equality

 RegretT(w∗)=T∑t=12ηt⟨xt,w∗−wt⟩+(⟨xt,w∗−wt⟩)2

A bound on both summands can clearly be expressed in terms of . The right summand requires a martingale concentration bound and the left is trivial. For both we obtain a bound of .

We are now left with two technicalities. The first is that is not necessarily the empirically optimal response. To this end we provide, in Lemma 16 in the appendix, a constant (independent of ) bound on the regret of compared to the empirical optimum. The second technicality is the fact that we do not solve for in every round, but in exponential gaps. This translates to an added factor of to the bound that affects only the constants in the terms.

###### Lemma 3 (Estimation Rates).

Assume that the matrix satisfies the RIP condition with for some . Let be the optimal solution of program (3). With probability at least ,

 ∥ˆwt+1−w∗∥1≤C⋅√dk0k2log(d/δ)t(σ+dk0).

Here is some universal constant and is the standard deviation of the noise.

Note the may not be sparse; it can have many non-zero coordinates that are small in absolute value. However, we take the top coordinates of in absolute value. Thanks to the Lemma 4 below, we lose only a constant factor .

###### Lemma 4.

Let be an arbitrary vector and let be a -sparse vector. Let be the top coordinates of in absolute value. Then,

 ∥∥ˆw(˜S)−w∗∥∥2≤√3∥ˆw−w∗∥2.

## 3 Agnostic Setting

In this section we focus on the agnostic setting, where we don’t impose any distributional assumption on the sequence. In this setting, there is no “true” sparse model, but the learner — with limited access to features — is competing with the best -sparse model defined using full information .

As before, we do assume that and are bounded. Without loss of generality, , and for all . Once again, without any regularity condition on the design matrix, Foster et al. (2016) have shown that achieving a sub-linear regret is in general computationally hard, for any constant unless .

We give an efficient algorithm that achieves sub-linear regret under the assumption that the design matrix of any (sufficiently long) block of consecutive data points has bounded restricted condition number, which we define below:

###### Definition 5 (Restricted Condition Number).

Let be a sparsity parameter. The restricted condition number for sparsity of a matrix is defined as

 supv,w: ∥v∥=∥w∥=1,∥v∥0,∥w∥0≤k∥Xv∥∥Xw∥.

It is easy to see that if a matrix satisfies RIP with parameters , then its restricted condition number for sparsity is at most . Thus, having bounded restricted condition number is a weaker requirement than RIP.

We now define the Block Bounded Restricted Condition Number Property (BBRCNP):

###### Definition 6 (Block Bounded Restricted Condition Number Property).

Let and . A sequence of feature vectors satisfies BBRCNP with parameters if there is a constant such that for any sequence of consecutive time steps with , the restricted condition number for sparsity of , the design matrix of the feature vectors for , is at most .

Note that in the random design setting where , for , are isotropic sub-Gaussian vectors, suffices to satisfy BBRCNP with high probability, where the notation hides a constant depending on .

We assume in this section that the sequence of feature vectors satisfies BBRCNP with parameters for some to be defined in the course of the analysis.

### 3.1 Algorithm

The algorithm in the agnostic setting is of distinct nature from that in the stochastic setting. Our algorithm is motivated from literature on maximization of sub-modular set function (Natarajan, 1995; Streeter and Golovin, 2008; Boutsidis et al., 2015). Though the problem being NP-hard, greedy algorithm on sub-modular maximization provides provable good approximation ratio. Specifically, Streeter and Golovin (2008) considered online optimization of super/sub-modular set functions using expert algorithm as sub-routine. Natarajan (1995); Boutsidis et al. (2015) cast the sparse linear regression as maximization of weakly supermodular function. We will introduce an algorithm that blends various ideas from referred literature, to attack the online sparse regression with limited features.

First, let’s introduce the notion of a weakly supermodular function.

###### Definition 7.

For parameters and , a set function is -weakly supermodular if for any two sets with , the following two inequalities hold:

1. (monotonicity) , and

2. (approximately decreasing marginal gain)

 g(S)−g(T)≤α∑i∈T∖S[g(S)−g(S∪{i})].

The definition is slightly stronger than that in Boutsidis et al. (2015). We will show that sparse linear regression can be viewed as weakly supermodular minimization in Definition 7 once the design matrix has bounded restricted condition number.

Now we outline the algorithm (see Algorithm 2). We divide the rounds into mini-batches of size each (so there are such batches). The -th batch thus consists of the examples for . Within the -th batch, our algorithm queries the same subset of features of size at most .

The algorithm consists of few key steps. First, one can show that under BBRCNP, as long as is large enough, the loss within batch defines a weakly supermodular set function

 gt(S)=1Binfw∈RS∑t∈Tb(yt−⟨xt,w⟩)2.

Therefore, we can formulate the original online sparse regression problem into online weakly supermodular minimization problem. For the latter problem, we develop an online greedy algorithm along the lines of (Streeter and Golovin, 2008). We employ budgeted experts algorithms (Amin et al., 2015), denoted BEXP, with budget parameter333We assume, for convenience, that is divisible by . . The precise characteristics of BEXP are given in Theorem 8 (adapted from Theorem 2 in (Amin et al., 2015)).

###### Theorem 8.

For the problem of prediction from expert advice, let there be experts, and let be a budget parameter. In each prediction round , the BEXP algorithm chooses an expert and a set of experts containing of size at most , obtains as feedback the losses of all the experts in , suffers the loss of expert , and guarantees an expected regret bound of over prediction rounds.

At the beginning of each mini-batch , the BEXP algorithms are run. Each BEXP algorithm outputs a set of coordinates of size as well as a special coordinate in that set. The union of all of these sets is then used as the set of features to query throughout the subsequent mini-batch. Within the mini-batch, the algorithm runs the standard Vovk-Azoury-Warmuth algorithm for linear prediction with square loss restricted to set of special coordinates output by all the BEXP algorithms.

At the end of the mini-batch, every BEXP algorithm is provided carefully constructed losses for each coordinate that was output as feedback. These losses ensure that the set of special coordinates chosen by the BEXP algorithms mimic the greedy algorithm for weakly supermodular minimization.

### 3.2 Main Result

In this section, we will show that Algorithm 2 achieves sublinear regret under BBRCNP.

###### Theorem 9.

Suppose the sequence of feature vectors satisfies BBRCNP with parameters for , and assume that is large enough so that . Then if Algorithm 2 is run with parameters and as specified above, its expected regret is at most .

###### Proof.

The proof relies on a number of lemmas whose proofs can be found in the appendix. We begin with the connection between sparse linear regression, weakly supermodular function and RIP, formally stated in Lemma 10. This lemma is a direct consequence of Lemma 5 in (Boutsidis et al., 2015).

###### Lemma 10.

Consider a sequence of examples for , and let be the design matrix for the sequence. Consider the set function associated with least squares optimization:

 g(S) =infw∈RS 1BB∑t=1(yt−⟨xt,w⟩)2.

Suppose the restricted condition number of for sparsity is bounded by . Then is -weakly supermodular.

Even though minimization of weakly supermodular functions is NP-hard, the greedy algorithm provides a good approximation, as shown in the next lemma.

###### Lemma 11.

Consider a -weakly supermodular set function . Let . Then, for any subset of size at most , we have

 g({j∗})−g(V)≤(1−1α|V|)[g(∅)−g(V)].

The BEXP algorithms essentially implement the greedy algorithm in an online fashion. Using the properties of the BEXP algorithm, we have the following regret guarantee:

###### Lemma 12.

Suppose the sequence of feature vectors satisfies BBRCNP with parameters . Then for any set of coordinates of size at most , we have

 E⎡⎣T/B∑b=1gb(V(k1)b)−gb(V)⎤⎦ ≤T/B∑b=1(1−1κ2|V|)k1[gb(∅)−gb(V)]+2κ2k√dk1log(d)Tk0B.

Finally, within every mini-batch, the VAW algorithm guarantees the following regret bound, an immediate consequence of Theorem 11.8 in Cesa-Bianchi and Lugosi (2006):

###### Lemma 13.

Within every batch , the VAW algorithm generates weight vectors for such that

 ∑t∈Tb(yt−⟨xt,wt⟩)2−Bgb(V(k1)b)≤O(k1log(B)).

We can now prove Theorem 9. Combining the bounds of lemma 12 and 13, we conclude that for any subset of coordinates of size at most , we have

 E[T∑t=1(yt−⟨xt,wt⟩)2] (7) ≤T/B∑b=1Bgb(V)+B(1−1κ2|V|)k1[gb(∅)−gb(V)] (8) +O(κ2k√dk1log(d)BTk0+TBk1log(B)). (9)

Finally, note that

 T/B∑b=1Bgb(V)≤infw∈RVT∑t=1(yt−⟨xt,w⟩)2,

and

 T/B∑b=1B(1−1κ2|V|)k1[gb(∅)−gb(V)]≤T⋅exp(−k1κ2k),

because . Using these bounds in (9), and plugging in the specified values of and , we get the stated regret bound. ∎

## 4 Conclusions and Future Work

In this paper, we gave computationally efficient algorithms for the online sparse linear regression problem under the assumption that the design matrices of the feature vectors satisfy RIP-type properties. Since the problem is hard without any assumptions, our work is the first one to show that assumptions that are similar to the ones used to sparse recovery in the batch setting yield tractability in the online setting as well.

Several open questions remain in this line of work and will be the basis for future work. Is it possible to improve the regret bound in the agnostic setting? Can we give matching lower bounds on the regret in various settings? Is it possible to relax the RIP assumption on the design matrices and still have efficient algorithms? Some obvious weakenings of the RIP assumption we have made don’t yield tractability. For example, simply assuming that the final matrix satisfies RIP rather than every intermediate matrix for large enough is not sufficient; a simple tweak to the lower bound construction of Foster et al. (2016) shows this. This tweak consists of simply padding the construction with enough dummy examples which are well-conditioned enough to overcome the ill-conditioning of the original construction so that RIP is satisfied by . We note however that in the realizable setting, our analysis can be easily adapted to work under weaker conditions such as irrepresentability (Zhao and Yu, 2006; Javanmard and Montanari, 2013).

## References

• Amin et al. (2015) Kareem Amin, Satyen Kale, Gerald Tesauro, and Deepak S. Turaga. Budgeted prediction with expert advice. In AAAI, pages 2490–2496, 2015.
• Belloni et al. (2016) Alexandre Belloni, Mathieu Rosenbaum, and Alexandre B. Tsybakov. Linear and conic programming estimators in high dimensional errors-in-variables models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016. ISSN 1467-9868.
• Bickel et al. (2009) Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
• Boutsidis et al. (2015) Christos Boutsidis, Edo Liberty, and Maxim Sviridenko. Greedy minimization of weakly supermodular set functions. arXiv preprint arXiv:1502.06528, 2015.
• Candes and Tao (2007) Emmanuel Candes and Terence Tao. The Dantzig selector: statistical estimation when is much larger than . The Annals of Statistics, pages 2313–2351, 2007.
• Candes and Tao (2005) Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
• Cesa-Bianchi and Lugosi (2006) Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
• Cesa-Bianchi et al. (2011) Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, and Ohad Shamir. Efficient learning with partially observed attributes. Journal of Machine Learning Research, 12(Oct):2857–2878, 2011.
• Foster et al. (2015) Dean Foster, Howard Karloff, and Justin Thaler. Variable selection is hard. In COLT, pages 696–709, 2015.
• Foster et al. (2016) Dean Foster, Satyen Kale, and Howard Karloff. Online sparse linear regression. In COLT, 2016.
• Hazan and Kale (2014) Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489–2512, 2014.
• Hazan and Koren (2012) Elad Hazan and Tomer Koren. Linear regression with limited observation. In ICML, 2012.
• Javanmard and Montanari (2013) Adel Javanmard and Andrea Montanari. Model selection for high-dimensional regression under the generalized irrepresentability condition. In NIPS, pages 3012–3020, 2013.
• Kale (2014) Satyen Kale. Open problem: Efficient online sparse regression. In COLT, pages 1299–1301, 2014.
• Kukliansky and Shamir (2015) Doron Kukliansky and Ohad Shamir. Attribute efficient linear regression with distribution-dependent sampling. In ICML, pages 153–161, 2015.
• Mendelson (2014) Shahar Mendelson. Learning without concentration. In COLT, pages 25–39, 2014.
• Natarajan (1995) Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234, 1995.
• Rosenbaum and Tsybakov (2010) Mathieu Rosenbaum and Alexandre B. Tsybakov. Sparse recovery under matrix uncertainty. The Annals of Statistics, 38(5):2620–2651, 2010.
• Streeter and Golovin (2008) Matthew J. Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In NIPS, pages 1577–1584, 2008.
• Zhao and Yu (2006) Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine learning research, 7(Nov):2541–2563, 2006.
• Zolghadr et al. (2013) Navid Zolghadr, Gábor Bartók, Russell Greiner, András György, and Csaba Szepesvári. Online learning with costly features and labels. In NIPS, pages 1241–1249, 2013.

## Appendix A Proofs for Realizable Setting

###### Proof of Lemma 3.

Let be the difference between the true answer and solution to the optimization problem. Let to be the support of and let be the complements of . Consider the permutation of for which for all . That is, the permutation dictated by the magnitude of the entries of outside of . We split into subsets of size according to this permutation: Define , for as . For convenience we also denote by the set .

Now, consider the matrix whose columns are those of with indices . The Restricted Isometry Property of dictates that for any vector ,

Let be the subspace of dimension that is the image of the linear operator , and let be the projection matrix onto that subspace. We have, for any vector that

 (1−ϵ)∥PVz∥≤1√n∥∥XTS01z∥∥≤(1+ϵ)∥PVz∥

We apply this to and conclude that

 ∥PVXΔ∥≤1√t(1−ϵ)∥∥XTS01XΔ∥∥ (10)

We continue to lower bound the quantity of . We decompose as

 PVXΔ=PVXΔ(S01)+∑j≥2PVXΔ(Sj) (11)

Now, according to the definition of we that there exist vectors in for which

 PVXΔ(Sj)=XS01cj

We now invoke Lemma 1.1 from Candes and Tao [2005] stating that for any with it holds that

 ∀c,c′   1n⟨XS′c,XS′′c′⟩≤(2ϵ−ϵ2)∥c∥2∥∥c′∥∥2

We apply this for , and conclude that

Dividing through by , we get

 ∥∥PVXΔ(Sj)∥∥≤2ϵ√t1−ϵ∥∥Δ(Sj)∥∥. (12)

Let us now bound the sum . By the definition of we know that any element has the property . Hence

We now combine this inequality with Equations (10), (11) and (12)

 1t∥∥XTS01XΔ∥∥ ≥1−ϵ√t∥PVXΔ∥ ≥1−ϵ√t∥XΔ(S01)∥−2ϵ∑j≥2∥∥Δ(Sj)∥∥ ≥1−ϵ√t∥XΔ(S01)∥−2ϵ√k∥Δ(Sc)∥1

The third inequality holds since hence . We continue to bound the expression by claiming that . This holds since in , hence

 ∥w∗∥1=∥ˆw−Δ(Sc)−Δ(S)∥1≤∥ˆw∥1+(∥Δ(S)∥1−∥Δ(Sc)∥1)

Now, the optimality of implies , hence indeed .

 ∥Δ(Sc)∥1≤∥Δ(S)∥1≤√k∥Δ(S)∥2≤∥Δ(S01)∥2≤√k(1−ϵ)√t∥XΔ(S01)∥

We continue the chain of inequalities

 1t∥∥XTS01XΔ∥∥ ≥1−ϵ√n∥XΔ(S01)∥−2ϵ√k∥Δ(Sc)∥1 ≥∥XΔ(S01)∥(1−ϵ√n−2ϵ√k⋅√k(1−ϵ)√n) =(1−ϵ)2−2ϵ(1−ϵ)√t∥XΔ(S01)∥

Rearranging we conclude that

 ∥Δ(S01)∥ ≤1(1−ϵ)√t∥XΔ(S01)∥ (RIP of X) ≤1((1−ϵ)2−2ϵ)t∥∥XTS01XΔ∥∥ ≤√2k(1−4ϵ)t∥∥XTXΔ∥∥∞ (since for any z∈R2k, ∥z∥2≤√2k∥z∥∞) ≤C√dklog(d/δ)tk0(σ+dk0∥w∗∥1) (Lemma 14 and ϵ<1/5)

for some constant . We continue our bound on by showing that

 ∥∥Δ(Sc01)∥∥22\lx@stackrel(i)≤∥Δ(Sc)∥21⋅∑j≥k+11j2≤1k