Active Learning for Accurate Estimation of Linear Models

# Active Learning for Accurate Estimation of Linear Models

###### Abstract

We explore the sequential decision-making problem where the goal is to estimate a number of linear models uniformly well, given a shared budget of random contexts independently sampled from a known distribution. For each incoming context, the decision-maker selects one of the linear models and receives an observation that is corrupted by the unknown noise level of that model. We present Trace-UCB, an adaptive allocation algorithm that learns the models’ noise levels while balancing contexts accordingly across them, and prove bounds for its simple regret in both expectation and high-probability. We extend the algorithm and its bounds to the high dimensional setting, where the number of linear models times the dimension of the contexts is more than the total budget of samples. Simulations with real data suggest that Trace-UCB is remarkably robust, outperforming a number of baselines even when its assumptions are violated.

boring formatting information, machine learning, ICML

## 1 Introduction

We study the problem faced by a decision-maker whose goal is to estimate a number of regression problems equally well (i.e., with a small prediction error for each of them), and has to adaptively allocate a limited budget of samples to the problems in order to gather information and improve its estimates. Two aspects of the problem formulation are key and drive the algorithm design: 1) The observations collected from each regression problem depend on side information (i.e., contexts ) and we model the relationship between and in each problem as a linear function with unknown parameters , and 2) The “hardness” of learning each parameter is unknown in advance and may vary across the problems. In particular, we assume that the observations are corrupted by noise levels that are problem-dependent and must be learned as well.

This scenario may arise in a number of different domains where a fixed experimentation budget (number of samples) should be allocated to different problems. Imagine a drug company that has developed several treatments for a particular form of disease. Now it is interested in having an accurate estimate of the performance of each of these treatments for a specific population of patients (e.g., at a particular geographical location). Given the budget allocated to this experiment, a number of patients can participate in the clinical trial. Volunteered patients arrive sequentially over time and they are represented by a context summarizing their profile. We model the health status of patient after being assigned to treatment by scalar , which depends on the specific drug through a linear function with parameter (i.e., ). The goal is to assign each incoming patient to a treatment in such a way that at the end of the trial, we have an accurate estimate for all ’s. This will allow us to reliably predict the expected health status of each new patient for any treatment . Since the parameters and the noise levels are initially unknown, achieving this goal requires an adaptive allocation strategy for the patients. Note that while may be relatively small, as the ethical and financial costs of treating a patient are high, the distribution of the contexts (e.g., the biomarkers of cancer patients) can be precisely estimated in advance.

This setting is clearly related to the problem of pure exploration and active learning in multi-armed bandits (Antos et al., 2008), where the learner wants to estimate the mean of a finite set of arms by allocating a finite budget of pulls. Antos et al. (2008) first introduced this setting where the objective is to minimize the largest mean square error (MSE) in estimating the value of each arm. While the optimal solution is trivially to allocate the pulls proportionally to the variance of the arms, when the variances are unknown an exploration-exploitation dilemma arises, where variance and value of the arms must be estimated at the same time in order to allocate pulls where they are more needed (i.e., arms with high variance). Antos et al. (2008) proposed a forcing algorithm where all arms are pulled at least times before allocating pulls proportionally to the estimated variances. They derived bounds on the regret, measuring the difference between the MSEs of the learning algorithm and an optimal allocation showing that the regret decreases as . A similar result is obtained by Carpentier et al. (2011) that proposed two algorithms that use upper confidence bounds on the variance to estimate the MSE of each arm and select the arm with the larger MSE at each step. When the arms are embedded in and their mean is a linear combination with an unknown parameter, then the problem becomes an optimal experimental design problem (Pukelsheim, 2006), where the objective is to estimate the linear parameter and minimize the prediction error over all arms (see e.g., Wiens & Li 2014; Sabato & Munos 2014). In this paper, we consider an orthogonal extension to the original problem where a finite number of linear regression problems is available (i.e., the arms) and random contexts are observed at each time step. Similarly to the setting of Antos et al. (2008), we assume each problem is characterized by a noise with different variance and the objective is to return regularized least-squares (RLS) estimates with small prediction error (i.e., MSE). While we leverage on the solution proposed by Carpentier et al. (2011) to deal with the unknown variances, in our setting the presence of random contexts make the estimation problem considerably more difficult. In fact, the MSE in one specific regression problem is not only determined by the variance of the noise and the number of samples used to compute the RLS estimate, but also by the contexts observed over time.

Contributions. We propose Trace-UCB, an algorithm that simultaneously learns the “hardness” of each problem, allocates observations proportionally to these estimates, and balances contexts across problems. We derive performance bounds for Trace-UCB in expectation and high-probability, and compare the algorithm with several baselines. Trace-UCB performs remarkably well in scenarios where the dimension of the contexts or the number of instances is large compared to the total budget, motivating the study of the high-dimensional setting, whose analysis and performance bounds are reported in App. F of Riquelme et al. (2017a). Finally, we provide simulations with synthetic data that support our theoretical results, and with real data that demonstrate the robustness of our approach even when some of the assumptions do not hold.

## 2 Preliminaries

The problem. We consider linear regression problems, where each instance is characterized by a parameter such that for any context , a random observation is obtained as

 Y=XTβi+ϵi, (1)

where the noise is an i.i.d. realization of a Gaussian distribution . We denote by and by , the largest and the average variance, respectively. We define a sequential decision-making problem over rounds, where at each round , the learning algorithm receives a context drawn i.i.d. from , selects an instance , and observes a random sample according to (1). By the end of the experiment, a training set has been collected and all the linear regression problems are solved, each problem with its own training set (i.e., a subset of containing samples with ), and estimates of the parameters are returned. For each , we measure its accuracy by the mean-squared error (MSE)

 Li,n(^βi,n)=EX[(XTβi−XT^βi,n)2]=∥βi−^βi,n∥2Σ. (2)

We evaluate the overall accuracy of the estimates returned by the algorithm as

 Ln(A)=maxi∈[m]EDn[Li,n(^βi,n)], (3)

where the expectation is w.r.t. the randomness of the contexts and observations used to compute . The objective is to design an algorithm that minimizes the loss (3). This requires defining an allocation rule to select the instance at each step and the algorithm to compute the estimates , e.g., ordinary least-squares (OLS), regularized least-squares (RLS), or Lasso. In designing a learning algorithm, we rely on the following assumption.

###### Assumption 1.

The covariance matrix of the Gaussian distribution generating the contexts is known.

This is a standard assumption in active learning, since in this setting the learner has access to the input distribution and the main question is for which context she should ask for a label (Sabato & Munos, 2014; Riquelme et al., 2017b). Often times, companies, like the drug company considered in the introduction, own enough data to have an accurate estimate of the distribution of their customers (patients).

While in the rest of the paper we focus on , our algorithm and analysis can be easily extended to similar objectives such as replacing the maximum in (3) with average across all instances, i.e., , and using weighted errors, i.e., , by updating the score to focus on the estimated standard deviation and by including the weights in the score, respectively. Later in the paper, we also consider the case where the expectation in (3) is replaced by the high-probability error (see Eq. 17).

Optimal static allocation with OLS estimates. While the distribution of the contexts is fixed and does not depend on the instance , the errors directly depend on the variances of the noise . We define an optimal baseline obtained when the noise variances are known. In particular, we focus on a static allocation algorithm that selects each instance exactly times, independently of the context,111This strategy can be obtained by simply selecting the first instance times, the second one times, and so on. and returns an estimate computed by OLS as

 (4)

where is the matrix of (random) samples obtained at the end of the experiment, and is its corresponding vector of observations. It is simple to show that the global error corresponding to is

 Ln(Astat)=maxi∈[m]σ2iki,nTr(ΣEDn[ˆΣ−1i,n]), (5)

where is the empirical covariance matrix of the contexts assigned to instance . Since the algorithm does not change the allocation depending on the contexts and , is distributed as an inverse-Wishart and we may write (5) as

 Ln(Astat)=maxi∈[m]dσ2iki,n−d−1. (6)

Thus, we derive the following proposition for the optimal static allocation algorithm .

###### Proposition 1.

Given linear regression problems, each characterized by a parameter , Gaussian noise with variance , and Gaussian contexts with covariance , let , then the optimal OLS static allocation algorithm selects each instance

 k∗i,n=σ2i∑jσ2j n+(d+1) (1−σ2i¯¯¯σ2), (7)

times (up to rounding effects), and incurs the global error

 L∗n=Ln(A∗stat)=¯¯¯σ2mdn+O(¯¯¯σ2(mdn)2). (8)
###### Proof.

See Appendix A.1.222All the proofs can be found in the appendices of the extended version of the paper (Riquelme et al., 2017a).

Proposition 1 divides the problems into two types: those for which (wild instances) and those for which (mild instances). We see that for the first type, the second term in (7) is negative and the instance should be selected less frequently than in the context-free case (where the optimal allocation is given just by the first term). On the other hand, instances whose variance is below the mean variance should be pulled more often. In any case, we see that the correction to the context-free allocation (i.e., the second term) is constant, as it does not depend on . Nonetheless, it does depend on and this suggests that in high-dimensional problems, it may significantly skew the optimal allocation.

While effectively minimizes the prediction loss , it cannot be implemented in practice since the optimal allocation requires the variances to be known at the beginning of the experiment. As a result, we need to devise a learning algorithm whose performance approaches as increases. More formally, we define the regret of as

 Rn(A)=Ln(A)−Ln(A∗stat)=Ln(A)−L∗n, (9)

and we expect . In fact, any allocation strategy that selects each instance a linear number of times (e.g., uniform sampling) achieves a loss , and thus, a regret of order . However, we expect that the loss of an effective learning algorithm decreases not just at the same rate as but also with the very same constant, thus implying a regret that decreases faster than .

## 3 The Trace-UCB Algorithm

In this section, we present and analyze an algorithm of the form discussed at the end of Section 2, which we call Trace-UCB, whose pseudocode is in Algorithm 1.

The regularization parameter is provided to the algorithm as input, while in practice one could set independently for each arm using cross-validation.

Intuition. Equation (6) suggests that while the parameters of the context distribution, particularly its covariance , do not impact the prediction error, the noise variances play the most important role in the loss of each problem instance. This is in fact confirmed by the optimal allocation in (7), where only the variances appear. This evidence suggests that an algorithm similar to GAFS-MAX (Antos et al., 2008) or CH-AS (Carpentier et al., 2011), which were designed for the context-free case (i.e., each instance is associated to an expected value and not a linear function) would be effective in this setting as well. Nonetheless, (6) holds only for static allocation algorithms that completely ignore the context and the history to decide which instance to choose at time . On the other hand, adaptive learning algorithms create a strong correlation between the dataset collected so far, the current context , and the decision . As a result, the sample matrix is no longer a random variable independent of , and using (6) to design a learning algorithm is not convenient, since the impact of the contexts on the error is completely overlooked. Unfortunately, in general, it is very difficult to study the potential correlation between the contexts , the intermediate estimates , and the most suitable choice . However, in the next lemma, we show that if at each step , we select as a function of , and not , we may still recover an expression for the final loss that we can use as a basis for the construction of an effective learning algorithm.

###### Lemma 2.

Let be a learning algorithm that selects the instances as a function of the previous history, i.e.,  and computes estimates using OLS. Then, its loss after steps can be expressed as

 Ln(A)=maxi∈[m] EDn[σ2iki,nTr(ΣˆΣ−1i,n)], (10)

where and .

###### Proof.

See Appendix B. ∎

#### Remark 1 (assumptions).

We assume noise and contexts are Gaussian. The noise Gaussianity is crucial for the estimates of the parameter and variance to be independent of each other, for each instance and time (we actually need and derive a stronger result in Lemma 9, see Appendix B). This is key in proving Lemma 2, as it allows us to derive a closed form expression for the loss function which holds under our algorithm, and is written in terms of the number of pulls and the trace of the inverse empirical covariance matrix. Note that drives our loss, while drives our decisions. One way to remove this assumption is by defining and directly optimizing a surrogate loss equal to (10) instead of (3). On the other hand, the Gaussianity of contexts leads to the whitened inverse covariance estimate being distributed as an inverse Wishart. As there is a convenient closed formula for its mean, we can find the exact optimal static allocation in Proposition 1, see (7). In general, for sub-Gaussian contexts, no such closed formula for the trace is available. However, as long as the optimal allocation has no second order terms for , it is possible to derive the same regret rate results that we prove later on for Trace-UCB.

Equation (10) makes it explicit that the prediction error comes from two different sources. The first one is the noise in the measurements , whose impact is controlled by the unknown variances ’s. Clearly, the larger the is, the more observations are required to achieve the desired accuracy. At the same time, the diversity of contexts across instances also impacts the overall prediction error. This is very intuitive, since it would be a terrible idea for the research center discussed in the introduction to estimate the parameters of a drug by providing the treatment only to a hundred almost identical patients. We say contexts are balanced when is well conditioned. Therefore, a good algorithm should take care of both aspects.

There are two extreme scenarios regarding the contributions of the two sources of error. 1) If the number of contexts is relatively large, since the context distribution is fixed, one can expect that contexts allocated to each instance eventually become balanced (i.e., Trace-UCB does not bias the distribution of the contexts). In this case, it is the difference in ’s that drives the number of times each instance is selected. 2) When the dimension or the number of arms is large w.r.t. , balancing contexts becomes critical, and can play an important role in the final prediction error, whereas the ’s are less relevant in this scenario. While a learning algorithm cannot deliberately choose a specific context (i.e.,  is a random variable), we may need to favor instances in which the contexts are poorly balanced and their prediction error is large, despite the fact that they might have small noise variances.

Algorithm. Trace-UCB is designed as a combination of the upper-confidence-bound strategy used in CH-AS (Carpentier et al., 2011) and the loss in (10), so as to obtain a learning algorithm capable of allocating according to the estimated variances and at the same time balancing the error generated by context mismatch. We recall that all the quantities that are computed at every step of the algorithm are indexed at the beginning and end of a step by (e.g., ) and (e.g., ), respectively. At the end of each step , Trace-UCB first computes an OLS estimate , and then use it to estimate the variance as

 ˆσ2i,t=1ki,t−d∥∥Yi,t−XTi,tˆβi,t∥∥2,

which is the average squared deviation of the predictions based on . We rely on the following concentration inequality for the variance estimate of linear regression with Gaussian noise, whose proof is reported in Appendix C.1.

###### Proposition 3.

Let the number of pulls and . If , then for any instance and step , with probability at least , we have

 |^σ2i,t−σ2i|≤Δi,tΔ=R√64ki,t−d(log2mnδ)2. (11)

Given (11), we can construct an upper-bound on the prediction error of any instance and time step as

 (12)

and then simply select the instance which maximizes this score, i.e., . Intuitively, Trace-UCB favors problems where the prediction error is potentially large, either because of a large noise variance or because of significant unbalance in the observed contexts w.r.t. the target distribution with covariance . A subtle but critical aspect of Trace-UCB is that by ignoring the current context (but using all the past samples ) when choosing , the distribution of the contexts allocated to each instance stays untouched and the second term in the score , i.e., , naturally tends to as more and more (random) contexts are allocated to instance . This is shown by Proposition 4 whose proof is in Appendix C.2.

###### Proposition 4.

Force the number of samples . If , for any and step with probability at least , we have

 (1−CTr√dki,t)2≤Tr(Σ^Σ−1i,t)d≤(1+2CTr√dki,t)2,

with .

While Proposition 4 shows that the error term due to context mismatch tends to the constant for all instances as the number of samples tends to infinity, when is small w.r.t.  and , correcting for the context mismatch may significantly improve the accuracy of the estimates returned by the algorithm. Finally, note that while Trace-UCB uses OLS to compute estimates , it computes its returned parameters by ridge regression (RLS) with regularization parameter as

 ^βλi=(XTi,nXi,n+λI)−1XTi,nYi,n. (13)

As we will discuss later, using RLS makes the algorithm more robust and is crucial in obtaining regret bounds both in expectation and high probability.

Performance Analysis. Before proving a regret bound for Trace-UCB, we report an intermediate result (proof in App. D.1) that shows that Trace-UCB behaves similarly to the optimal static allocation.

###### Theorem 5.

Let . With probability at least , the total number of contexts that Trace-UCB allocates to each problem instance after rounds satisfies

 ki,n≥k∗i,n−CΔ+8CTrσ2min√ndλmin−Ω(n1/4) (14)

where is known by the algorithm, and we defined and .

We now report our regret bound for the Trace-UCB algorithm. The proof of Theorem 6 is in Appendix D.2.

###### Theorem 6.

The regret of the Trace-UCB algorithm, i.e., the difference between its loss and the loss of optimal static allocation (see Eq. (8)), is upper-bounded by

 Ln(A)−L∗n≤O(1σ2min(dλminn)3/2). (15)

Eq. (15) shows that the regret decreases as as expected. This is consistent with the context-free results (Antos et al., 2008; Carpentier et al., 2011), where the regret decreases as , which is conjectured to be optimal. However, it is important to note that in the contextual case, the numerator also includes the dimensionality . Thus, when , the regret will be small, and it will be larger when . This motivates studying the high-dimensional setting (App. F). Eq. (15) also indicates that the regret depends on a problem-dependent constant , which measures the complexity of the problem. Note that when , we have , but could be much larger when .

#### Remark 2.

We introduce a baseline motivated by the context-free problem. At round , let Var-UCB selects the instance that maximizes the score333Note that Var-UCB is similar to both the CH-AS and B-AS algorithms in Carpentier et al. (2011).

 s′i,t−1=^σ2i,t−1+Δi,t−1ki,t−1. (16)

The only difference with the score used by Trace-UCB is the lack of the trace term in (12). Moreover, the regret of this algorithm has similar rate in terms of and as that of Trace-UCB reported in Theorem 6. However, the simulations of Sect. 4 show that the regret of Var-UCB is actually much higher than that of Trace-UCB, specially when is close to . Intuitively, when is close to , balancing contexts becomes critical, and Var-UCB suffers because its score does not explicitly take them into account.

Sketch of the proof of Theorem 6. The proof is divided into three parts. 1) We show that the behavior of the ridge loss of Trace-UCB is similar to that reported in Lemma 2 for algorithms that rely on OLS; see Lemma 19 in Appendix E. The independence of the and estimates is again essential (see Remark 1). Although the loss of Trace-UCB depends on the ridge estimate of the parameters , the decisions made by the algorithm at each round only depend on the variance estimates and observed contexts. 2) We follow the ideas in Carpentier et al. (2011) to lower-bound the total number of pulls for each under a good event (see Theorem 5 and its proof in Appendix D.1). 3) We finally use the ridge regularization to bound the impact of those cases outside the good event, and combine everything in Appendix D.2.

The regret bound of Theorem 6 shows that the largest expected loss across the problem instances incurred by Trace-UCB quickly approaches the loss of the optimal static allocation algorithm (which knows the true noise variances). While measures the worst expected loss, at any specific realization of the algorithm, there may be one of the instances which is very poorly estimated. As a result, it would also be desirable to obtain guarantees for the (random) maximum loss

 ˜Ln(A)=maxi∈[m]∥βi−^βi,n∥2Σ. (17)

In particular, we are able to prove the following high-probability bound on for Trace-UCB.

###### Theorem 7.

Let , and assume for all , for some . With probability at least ,

 ˜Ln≤m∑j=1σ2jn(d+2log3mδ)+O(1σ2min(dnλmin)32). (18)

Note that the first term in (18) corresponds to the first term of the loss for the optimal static allocation, and the second term is, again, a deviation. However, in this case, the guarantees hold simultaneously for all the instances.

Sketch of the proof of Theorem 7. In the proof we slightly modify the confidence ellipsoids for the ’s, based on self-normalized martingales, and derived in (Abbasi-Yadkori et al., 2011); see Thm. 13 in App. C. By means of the confidence ellipsoids we control the loss in (17). Their radiuses depend on the number of samples per instance, and we rely on a high-probability events to compute a lower bound on the number of samples. In addition, we need to make sure the mean norm of the contexts will not be too large (see Corollary 15 in App. C). Finally, we combine the lower bound on with the confidence ellipsoids to conclude the desired high-probability guarantees in Thm. 7.

High-Dimensional Setting. High-dimensional linear models are quite common in practice, motivating the study of the case, where the algorithms discussed so far break down. We propose Sparse-Trace-UCB in Appendix F, an extension of Trace-UCB that assumes and takes advantage of joint sparsity across the linear functions. The algorithm has two-stages: first, an approximate support is recovered, and then, Trace-UCB is applied to the induced lower dimensional space. We discuss and extend our high-probability guarantees to Sparse-Trace-UCB under suitable standard assumptions in Appendix F.

## 4 Simulations

In this section, we provide empirical evidence to support our theoretical results. We consider both synthetic and real-world problems, and compare the performance (in terms of normalized MSE) of Trace-UCB to uniform sampling, optimal static allocation (which requires the knowledge of noise variances), and the context-free algorithm Var-UCB (see Remark 2). We do not compare to GFSP-MAX and GAFS-MAX (Antos et al., 2008) since they are outperformed by CH-AS Carpentier et al. (2011) and Var-UCB is the same as CH-AS, except for the fact that we use the concentration inequality in Prop. 3, since we are estimating the variance from a regression problem using OLS.

First, we use synthetic data to ensure that all the assumptions of our model are satisfied, namely we deal with linear regression models with Gaussian context and noise. We set the number of problem instances to and consider two scenarios: one in which all the noise variances are equal to and one where they are not equal, and . In the latter case, . We study the impact of (independently) increasing dimension and horizon on the performance, while keeping all other parameters fixed. Second, we consider real-world datasets in which the underlying model is non-linear and the contexts are not Gaussian, to observe how Trace-UCB behaves (relative to the baselines) in settings where its main underlying assumptions are violated.

Synthetic Data. In Figures 1(a,b), we display the results for fixed horizon and increasing dimension . For each value of , we run simulations and report the median of the maximum error across the instances for each simulation. In Fig. 1(a), where ’s are equal, uniform sampling and optimal static allocation execute the same allocation since there is no difference in the expected losses of different instances. Nonetheless we notice that Var-UCB suffers from poor estimation as soon as increases, while Trace-UCB is competitive with the optimal performance. This difference in performance can be explained by the fact that Var-UCB does not control for contextual balance, which becomes a dominant factor in the loss of a learning strategy for problems of high dimensionality. In Fig. 1(b), in which ’s are different, uniform sampling is no longer optimal but even in this case Var-UCB performs better than uniform sampling only for small , where it is more important to control for the ’s. For larger dimensions, balancing uniformly the contexts eventually becomes a better strategy, and uniform sampling outperforms Var-UCB. In this case too, Trace-UCB is competitive with the optimal static allocation even for large , successfully balancing both noise variance and contextual error.

Next, we study the performance of the algorithms w.r.t. . We report two different losses, one in expectation (3) and one in high probability (17), corresponding to the results we proved in Theorems 6 and 7, respectively. In order to approximate the loss in (3) (Figures 1(c,d)) we run simulations, compute the average prediction error for each instance , and finally report the maximum mean error across the instances. On the other hand, we estimate the loss in (17) (Figures 1(e,f)) by running simulations, taking the maximum prediction error across the instances for each simulation, and finally reporting their median.

In Figures 1(c, d), we display the loss for fixed dimension and horizon from to . In Figure 1(c), Trace-UCB performs similarly to the optimal static allocation, whereas Var-UCB performs significantly worse, ranging from 25% to 50% higher errors than Trace-UCB, due to some catastrophic errors arising from unlucky contextual realizations for an instance. In Fig. 1(d), as the number of contexts grows, uniform sampling’s simple context balancing approach is enough to perform as well as Var-UCB that again heavily suffers from large mistakes. In both figures, Trace-UCB smoothly learns the ’s and outperforms uniform sampling and Var-UCB. Its performance is comparable to that of the optimal static allocation, especially in the case of equal variances in Fig. 1(c).

In Figure 1(e), Trace-UCB learns and properly balances observations extremely fast and obtains an almost optimal performance. Similarly to figures 1(a,c), Var-UCB struggles when variances are almost equal, mainly because it gets confused by random deviations in variance estimates , while overlooking potential and harmful context imbalances. Note that even when (rightmost point), its median error is still higher than Trace-UCB’s. In Fig. 1(f), as expected, uniform sampling performs poorly, due to mismatch in variances, and only outperforms Var-UCB for small horizons in which uniform allocation pays off. On the other hand, Trace-UCB is able to successfully handle the tradeoff between learning and allocating according to variance estimates , while accounting for the contextual trace , even for very low . We observe that for large , Var-UCB eventually reaches the performance of the optimal static allocation and Trace-UCB.

In practice the loss in (17) (figures 1(e,f)) is often more relevant than (3), since it is in high probability and not in expectation, and Trace-UCB shows excellent performance and robustness, regardless of the underlying variances .

Real Data. Trace-UCB is based on assumptions such as linearity, and Gaussianity of noise and context that may not hold in practice, where data may show complex dependencies. Therefore, it is important to evaluate the algorithm with real-world data to see its robustness to the violation of its assumptions. We consider two collaborative filtering datasets in which users provide ratings for items. We choose a dense subset of users and items, where every user has rated every item. Thus, each user is represented by a -dimensional vector of ratings. We define the user context by out of her ratings, and learn to predict her remaining ratings (each one is a problem instance). All item ratings are first centered, so each item’s mean is zero. In each simulation, out of the users are selected at random to be fed to the algorithm, also in random order. Algorithms can select any instance as the dataset contains the ratings of every instance for all the users. At the end of each simulation, we compute the prediction error for each instance by using the users that did not participate in training for that simulation. Finally, we report the median error across all simulations.

Fig. 2(a) reports the results using the Jester Dataset by (Goldberg et al., 2001) that consists of joke ratings in a continuous scale from to . We take joke ratings as context and learn the ratings for another jokes. In addition, we add another function that counts the total number of movies originally rated by the user. The latter is also centered, bounded to the same scale, and has higher variance (without conditioning on ). The number of total users is , and . When the number of observations is limited, the advantage of Trace-UCB is quite significant (the improvement w.r.t. uniform allocation goes from 45% to almost 20% for large , while w.r.t. Var-UCB it goes from almost 30% to roughly 5%), even though the model and context distribution are far from linear and Gaussian, respectively.

Fig. 2(b) shows the results for the MovieLens dataset (Maxwell Harper & Konstan, 2016) that consists of movie ratings between and with increments. We select popular movies rated by users, and randomly choose of them to learn (so ). In this case, all problems have similar variance () so uniform allocation seems appropriate. Both Trace-UCB and Var-UCB modestly improve uniform allocation, while their performance is similar.

## 5 Conclusions

We studied the problem of adaptive allocation of contextual samples of dimension to estimate linear functions equally well, under heterogenous noise levels that depend on the linear instance and are unknown to the decision-maker. We proposed Trace-UCB, an optimistic algorithm that successfully solves the exploration-exploitation dilemma by simultaneously learning the ’s, allocating samples accordingly to their estimates, and balancing the contextual information across the instances. We also provide strong theoretical guarantees for two losses of interest: in expectation and high-probability. Simulations were conducted in several settings, with both synthetic and real data. The favorable results suggest that Trace-UCB is reliable, and remarkably robust even in settings that fall outside its assumptions, thus, a useful and simple tool to implement in practice.

Acknowledgements. A. Lazaric is supported by French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01).

## References

• Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, Cs. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
• Antos et al. (2008) Antos, A., Grover, V., and Szepesvári, Cs. Active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory, pp. 287–302, 2008.
• Carpentier et al. (2011) Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., and Auer, P. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In Algorithmic Learning Theory, pp. 189–203. Springer, 2011.
• Goldberg et al. (2001) Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 4(2):133–151, 2001.
• Hastie et al. (2015) Hastie, T., Tibshirani, R., and Wainwright, M. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.
• Maxwell Harper & Konstan (2016) Maxwell Harper, F. and Konstan, J. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.
• Negahban & Wainwright (2011) Negahban, S. and Wainwright, M. Simultaneous support recovery in high dimensions: Benefits and perils of block-regularization. IEEE Transactions on Information Theory, 57(6):3841–3863, 2011.
• Obozinski et al. (2011) Obozinski, G., Wainwright, M., and Jordan, M. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, pp. 1–47, 2011.
• Pukelsheim (2006) Pukelsheim, F. Optimal Design of Experiments. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, 2006.
• Raskutti et al. (2010) Raskutti, G., Wainwright, M. J, and Yu, B. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(8):2241–2259, 2010.
• Riquelme et al. (2017a) Riquelme, C., Ghavamzadeh, M., and Lazaric, A. Active learning for accurate estimation of linear models. arXiv preprint arXiv:1703.00579, 2017a.
• Riquelme et al. (2017b) Riquelme, C., Johari, R., and Zhang, B. Online active linear regression via thresholding. In Thirty-First AAAI Conference on Artificial Intelligence, 2017b.
• Sabato & Munos (2014) Sabato, S. and Munos, R. Active regression by stratification. In Advances in Neural Information Processing Systems, pp. 469–477, 2014.
• Vershynin (2010) Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027, 2010.
• Wainwright (2015) Wainwright, M. High-dimensional statistics: A non-asymptotic viewpoint. Draft, 2015.
• Wang et al. (2013) Wang, W., Liang, Y., and Xing, E. Block regularized lasso for multivariate multi-response linear regression. In AISTATS, 2013.
• Wiens & Li (2014) Wiens, D. and Li, P. V-optimal designs for heteroscedastic regression. Journal of Statistical Planning and Inference, 145:125–138, 2014.

## Appendix A Optimal Static Allocation

### a.1 Proof of Proposition 1

###### Proposition.

Given linear regression problems, each characterized by a parameter , Gaussian noise with variance , and Gaussian contexts with covariance , let , then the optimal OLS static allocation algorithm selects each instance

 k∗i,n=σ2i∑jσ2j n+(d+1) (1−σ2i¯¯¯σ2), (19)

times (up to rounding effects), and incurs the global error

 L∗n=Ln(A∗stat)=¯¯¯σ2mdn+O(¯¯¯σ2(mdn)2). (20)
###### Proof.

For the sake of readability in the following we drop the dependency on .

We first derive the equality in Eq. 2

 Li(^βi) =EX[(XTβi−XT^βi)2] =EX[(^βi−βi)TXXT(^βi−βi)] =(^βi−βi)TE[XXT](^βi−βi) =(^βi−βi)TΣ(^βi−βi) =∥βi−^βi∥2Σ.

As a result, we can write the global error as

 Ln(Astat) =maxi∈[m] EDi,n[∥βi−^βi∥2Σ] =maxi∈[m] EDi,n[Tr((βi−^βi)TΣ(βi−^βi))] =maxi∈[m] EDi,n[Tr(Σ(βi−^βi)(βi−^βi)T)] =maxi∈[m] Tr(EDi,n[Σ(βi−^βi)(βi−^βi)T]),

where is the training set extracted from containing the samples for instance . Since contexts and noise are independent random variables, we can decompose into the randomness related to the context matrix and the noise vector . We recall that for any fixed realization of , the OLS estimates is distributed as

 ^βi∣Xi∼N(βi,σ2i(XTiXi)−1), (21)

which means that conditioned on is unbiased with covariance matrix given by . Thus, we can further develop as

 Ln(Astat) (22)

where is a whitened context and is its corresponding whitened matrix. Since whitened contexts are distributed as , we know that is distributed as an inverse Wishart , whose expectation is , and thus,

 Ln(Astat) =maxi∈[m] σ2iTr[1ki−d−1Id]=maxi∈[m] σ2i dki−d−1. (23)

Note that this final expression requires that , since it is not possible to compute an OLS estimate with less than samples. Therefore, we proceed by minimizing Eq. 23, subject to . We write for some . Thus, equivalently, we minimize

 Ln(Astat)=maxi σ2i dk′i. (24)

Since , we may conclude that the optimal is given by

 k′i=σ2i∑jσ2j (n−m(d+1)),

so that all the terms in the RHS of Eq. 24 are equal. This gives us the optimal static allocation

 k∗i =σ2i∑jσ2j(n−m(d+1))+d+1 =σ2i∑jσ2jn+(d+1)(1−σ2i¯¯¯σ2), (25)

where is the mean variance across the problem instances.

Thus, for the optimal static allocation, the expected loss is given by

 L∗n=Ln(A∗stat) =dmaxi σ2iσ2i∑jσ2j n−(d+1)σ2i¯σ2 =(∑jσ2j)dn−m(d+1) =(∑jσ2j)dn+(∑jσ2j)md(d+1)n(n−m(d+1)) =(∑jσ2j)dn+O⎛⎜ ⎜⎝(∑jσ2j)md2n2⎞⎟ ⎟⎠,

which concludes the proof. Furthermore the following bounds trivially holds for any

 md¯¯¯σ2n≤L∗n≤2md¯¯¯σ2n.

## Appendix B Loss of an OLS-based Learning Algorithm (Proof of Lemma 2)

Unlike in the proof of Proposition 1, when the number of pulls is random and depends on the value of the previous observations (through ), then in general, the OLS estimates are no longer distributed as Eq. 21 and the derivation for no longer holds. In fact, for a learning algorithm, the value itself provides some information about the observations that have been obtained up until time and were used by the algorithm to determine . In the following, we show that by ignoring the current context when choosing instance , we are still able to analyze the loss of Trace-UCB and obtain a result very similar to the static case.

We first need two auxiliary lemmas (Lemmas 8 and 9), one on the computation of an empirical estimate of the variance of the noise, and an independence result between the variance estimate and the linear regression estimate.

###### Lemma 8.

In any linear regression problem with noise , after samples, given an OLS estimator , the noise variance estimator can be computed in a recurrent form as

 ^σ2t+1=t−dt−d+1 ^σ2t+1t−d+1 (XTt+1^βt−Yt+1)21+XTt+1(XTtXt)−1Xt+1, (26)

where is the sample matrix.

###### Proof.

We first recall the “batch” definition of the variance estimator

 ˆσ2t=1t−dt∑s=1(Ys−XTsˆβt)2=1t−d∥Yt−XTtˆβt∥2

Since and , we have

 ˆσ2t=1t−d∥(XTtXt)−1XTtϵt−ϵt∥2=1t−d(ϵTtϵt−ϵTtXt(XTtXt)−1XTtϵt)=1t−d(Et+1−Vt+1).

We now devise a recursive formulation for the two terms in the previous expression. We have

 Et+1=ϵTt+1ϵt+1=ϵTtϵt+ϵ2t+1=Et+ϵ2t+1.

In order to analyze the second term we first introduce the design matrix , which has the simple update rule . Then we have

 Vt+1 =(ϵTtXt+ϵt+1XTt+1)(St+Xt+1XTt+1)−1(ϵTtXt+ϵt+1XTt+1)T

where we used the Sherman-Morrison formula in the last equality. We further develop the previous expression as

 Vt+1 −ϵTtXtS−1tXt+1XTt+1S−1t1+XTt+1S−1tXt+1XTtϵt−ϵt+1XTt+1S−1tXt+1XTt+1S−1t1+XTt+1S−1tXt+1Xt+1ϵt+1−2ϵTtXtS−1tXt+1XTt+1S−1t1+XTt+1S−1tXt+1Xt+1ϵt+1.

We define and , and then obtain

 Vt+1 =Vt+ϵ2t+1ψt+1+2αt+1ϵt+1−α2t+11+ψt+1−ϵ2t+1ψ2t+11−ψt+1−2ϵt+1αt+1ψt+