Sequential Transfer in Multi-armed Banditwith Finite Set of Models

# Sequential Transfer in Multi-armed Bandit with Finite Set of Models

School of Computer Science
Carnegie Mellon University
mazar@cs.cmu.edu
Alessandro Lazaric
INRIA Lille - Nord Europe
Villeneuve d’Ascq, France
alessandro.lazaric@inria.fr
Emma Brunskill
School of Computer Science
Carnegie Mellon University
ebrun@cs.cmu.edu
###### Abstract

Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi-armed bandit framework, where the objective is to minimize the cumulative regret over a sequence of tasks by incrementally transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for the estimation of the possible tasks and derive regret bounds for it.

## 1 Introduction

Learning from prior tasks and transferring that experience to improve future performance is a key aspect of intelligence, and is critical for building lifelong learning agents. Recently, multi-task and transfer learning received much attention in the supervised and reinforcement learning (RL) setting with both empirical and theoretical encouraging results (see recent surveys by Pan and Yang, 2010; Lazaric, 2011). Most of these works focused on scenarios where the tasks are batch learning problems, in which a training set is directly provided to the learner. On the other hand, the online learning setting (Cesa-Bianchi and Lugosi, 2006), where the learner is presented with samples in a sequential fashion, has been rarely considered (see Mann and Choe (2012) for an example in RL and Sec. E in the supplementary material for a discussion on related settings).

In this paper we rely on a new variant of method-of-moments (Anandkumar et al., 2012a, c), the robust tensor power method (RTP) (Anandkumar et al., 2012b), to estimate the LVM associated with the sequential-bandit problem. RTP relies on decomposing the eigenvalues/eigenvectors of certain tensors for estimating the model means (Anandkumar et al., 2012b). We prove that RTP provides a consistent estimate of the means of all arms for every bandit problem as long as they are pulled at least three times per task (Sec. 4.2). This guarantees that once RTP is paired with an efficient bandit algorithm able to exploit the transferred knowledge about the models (Sec. 4.3), we obtain a bandit algorithm, called tUCB, guaranteed to perform as well as UCB in early episodes, thus avoiding any negative transfer effect, and then to approach the performance of the ideal case when the set of bandit problems is known in advance (Sec. 4.4). Finally, we report some preliminary results on synthetic data confirming the theoretical findings (Sec. 5).

## 2 Preliminaries

We consider a stochastic MAB problem defined by a set of arms , , where each is characterized by a distribution and the samples observed from each arm are independent and identically distributed. We focus on the setting where there exists a set of models , , which contains all the possible bandit problems. We denote the mean of an arm , the best arm, and the best value of a model respectively by , , . We define the arm gap of an arm for a model as , while the model gap for an arm between two models and is defined as .

We also introduce some tensor notation. Let be a random realization of all the arms from a random model. All the realizations are i.i.d. conditional on a model and , where the -th component of is . Given two realizations and , we define the second moment matrix such that and the third moment tensor . Since the realizations are conditionally independent, we have that and this allows us to rewrite the second and third moments as  (Anandkumar et al., 2012c), where is the -th tensor power. Let be a order member of the tensor product of the Euclidean space (as ), then we define the multilinear map as follows. For a set of three matrices , the entry in the -way array representation of is . We also use different norms: the Euclidean norm ; the Frobenius norm ; the matrix max-norm .

We consider the sequential transfer setting where at each episode the learner interacts with a task , drawn from a distribution over , for steps. The objective is to minimize the (pseudo-)regret over episodes measured as the difference between the rewards obtained by the optimal arms and the rewards achieved by the learner. More formally, the regret is defined as

 RJ=∑∑Jj=1Rjn=∑∑Jj=1∑∑i≠i∗Tji,nΔi(¯θj), (1)

where is the number of pulls to arm after steps of episode . The only information available to the learner is the number of models , number of episodes and number of steps per task.

## 3 Mult-armed Bandit with Finite Models

Before considering the transfer problem, we show that a simple variation to UCB allows to effectively exploit the knowledge of and obtain a significant reduction in the regret. The mUCB (model-UCB) algorithm in Fig. 1 takes as input a set of models including the current (unknown) model . At each step , the algorithm computes a subset containing only the models whose means are compatible with the current estimates of the means of the current model, obtained averaging pulls, and their uncertainty (see Eq. 2 for an explicit definition of this term). Notice that it is enough that one arm does not satisfy the compatibility condition to discard a model . Among all the models in , mUCB first selects the model with the largest optimal value and then it pulls its corresponding optimal arm. This choice is coherent with the optimism in the face of uncertainty principle used in UCB-based algorithms, since mUCB always pulls the optimal arm corresponding to the optimistic model compatible with the current estimates . We show that mUCB incurs a regret which is never worse than UCB and it is often significantly smaller.

We denote the set of arms which are optimal for at least a model in a set as . The set of models for which the arms in are optimal is . The set of optimistic models for a given model is , and their corresponding optimal arms . The following theorem bounds the expected regret (similar bounds hold in high probability). The lemmas and proofs (using standard tools from the bandit literature) are available in Sec. B of the supplementary material.

###### Theorem 1.

If mUCB is run with , a set of models such that the and

 εi,t=√log(mn2/δ)/(2Ti,t−1), (2)

where is the number of pulls to arm at the beginning of step , then its expected regret is

 E[Rn]≤K+∑∑i∈A+2Δi(¯θ)log(mn3)minθ∈Θ+,iΓi(θ,¯θ)2≤K+∑∑i∈A+2log(mn3)minθ∈Θ+,iΓi(θ,¯θ), (3)

where is the set of arms which are optimal for at least one optimistic model and is the set of optimistic models for which is the optimal arm.

Remark (comparison to UCB). The UCB algorithm incurs a regret

 E[Rn(UCB)]≤O(∑∑i∈AlognΔi(¯θ))≤O(KlognminiΔi(¯θ)).

We see that mUCB displays two major improvements. The regret in Eq. 3 can be written as

 E[Rn(mUCB)]≤O(∑∑i∈A+lognminθ∈Θ+,iΓi(θ,¯θ))≤O(|A+|lognminiminθ∈Θ+,iΓi(θ,¯θ)).

This result suggests that mUCB tends to discard all the models in from the most optimistic down to the actual model which, with high-probability, is never discarded. As a result, even if other models are still in , the optimal arm of is pulled until the end. This significantly reduces the set of arms which are actually pulled by mUCB and the previous bound only depend on the number of arms in , which is . Furthermore, it is possible to show that for all arms , the minimum gap is guaranteed to be larger than the arm gap (see Lem. 4 in Sec. B), thus further improving the performance of mUCB w.r.t. UCB.

## 4 Online Transfer with Unknown Models

We now consider the case when the set of models is unknown and the regret is cumulated over multiple tasks drawn from (Eq. 1). We introduce tUCB (transfer-UCB) which transfers estimates of , whose accuracy is improved through episodes using a method-of-moments approach.

### 4.1 The transfer-UCB Bandit Algorithm

Fig. 3 outlines the structure of our online transfer bandit algorithm tUCB (transfer-UCB). The algorithm uses two sub-algorithms, the bandit algorithm umUCB (uncertain model-UCB), whose objective is to minimize the regret at each episode, and RTP (robust tensor power method) which at each episode computes an estimate of the arm means of all the models. The bandit algorithm umUCB in Fig. 3 is an extension of the mUCB algorithm. It first computes a set of models whose means are compatible with the current estimates . However, unlike the case where the exact models are available, here the models themselves are estimated and the uncertainty in their means (provided as input to umUCB) is taken into account in the definition of . Once the active set is computed, the algorithm computes an upper-confidence bound on the value of each arm for each model and returns the best arm for the most optimistic model. Unlike in mUCB, due to the uncertainty over the model estimates, a model might have more than one optimal arm, and an upper-confidence bound on the mean of the arms is used together with the upper-confidence bound , which is directly derived from the samples observed so far from arm . This guarantees that the -values are always consistent with the samples generated from the actual model . Once umUCB terminates, RTP (Fig. 4) updates the estimates of the model means using the samples obtained from each arm . At the beginning of each task umUCB pulls all the arms times, since RTP needs at least samples from each arm to accurately estimate the and moments (Anandkumar et al., 2012b). More precisely, RTP uses all the reward samples generated up to episode to estimate the and moments (see Sec. 2) as

 ˆM2=j−1∑∑jl=1¯¯¯μ1l⊗¯¯¯μ2l,andˆM3=j−1∑∑jl=1¯¯¯μ1l⊗¯¯¯μ2l⊗¯¯¯μ3l, (4)

where the vectors are obtained by dividing the samples observed from arm in episode in three batches and taking their average (e.g., is the average of the first samples).111Notice that , the empirical mean of arm at the end of episode . Since are independent estimates of , and are consistent estimates of the second and third moments and . RTP relies on the fact that the model means can be recovered from the spectral decomposition of the symmetric tensor , where is a whitening matrix for , i.e., (see Sec. 2 for the definition of the mapping ). Anandkumar et al. (2012b) (Thm. 4.3) have shown that under some mild assumption (see later Assumption 1) the model means , can be obtained as , where is a pair of eigenvector/eigenvalue for the tensor and .Thus the RTP algorithm estimates the eigenvectors and the eigenvalues , of the tensor .222The matrix is such that , i.e., is the whitening matrix of . In general is not unique. Here, we choose , where is a diagonal matrix consisting of the largest eigenvalues of and has the corresponding eigenvectors as its columns. Once and are computed, the estimated mean vector is obtained by the inverse transformation , where is the pseudo inverse of (for a detailed description of RTP algorithm see Anandkumar et al., 2012b).

### 4.2 Sample Complexity of the Robust Tensor Power Method

umUCB requires as input , i.e., the uncertainty of the model estimates. Therefore we need finite sample complexity bounds on the accuracy of computed by RTP. The performance of RTP is directly affected by the error of the estimates and w.r.t. the true moments. In Thm. 2 we prove that, as the number of tasks grows, this error rapidly decreases with the rate of . This result provides us with an upper-bound on the error needed for building the confidence intervals in umUCB. The following definition and assumption are required for our result.

###### Definition 1.

Let be the set of largest eigenvalues of the matrix . Define , and . Define the minimum gap between the distinct eigenvalues of as .

###### Assumption 1.

The mean vectors are linear independent and for all .

We now state our main result which is in the form of a high probability bound on the estimation error of mean reward vector of every model .

###### Theorem 2.

Pick . Let , where is a universal constant. Then under Assumption 1 there exist constants and a permutation on such that after tasks

 maxθ∥μ(θ)−ˆμj(π(θ))∥≤C(Θ)K2.5m2√log(K/δ)j,

w.p. , given that

 j≥C4m5K6log(K/δ)min(σmin,Γσ)2σ3minλ2min. (5)

Remark (comparison with the previous bounds). This bound improves on the previous bounds of Anandkumar et al. (2012c, a) moving from a dependency on the number of models of order to a milder quadratic dependency on .333Note that the improvement is mainly due to accuracy of the orthogonal tensor decomposition obtained via the tensor power method relative to the previously cited works. This is a direct consequence of the perturbation bound of Anandkumar et al. (2012b, Thm. 5.1), which is at the core of our sample complexity bound. 444The result of Anandkumar et al. (2012a) has the explicit dependency of order on the number of model as well as implicit dependency of order through the parameter . Although the dependency on is a bit worse in our bounds in comparison to those of Anandkumar et al. (2012c, a), here we have the advantage that there is no dependency on the smallest singular value of the matrix , whereas those results scale polynomially with this factor.

Remark (computation of ). As illustrated in Fig. 3, umUCB relies on the estimates and on their accuracy . Although the bound reported in Thm. 2 provides an upper confidence bound on the error of the estimates, it contains terms which are not computable in general (e.g., ). In practice, should be considered as a parameter of the algorithm.555One may also estimate the constant in an online fashion using doubling trick (Audibert et al., 2012). This is not dissimilar from to the parameter usually introduced in the definition of in front of the square-root term in UCB.

### 4.3 Regret Analysis of umUCB

We now analyze the regret of umUCB when an estimated set of models is provided as input. At episode , for each model we define the set of non-dominated arms (i.e., potentially optimal arms) as . Among the non-dominated arms, when the actual model is , the set of optimistic arms is . As a result, the set of optimistic models is . In some cases, because of the uncertainty in the model estimates, unlike in mUCB, not all the models can be discarded, not even at the end of a very long episode. Among the optimistic models, the set of models that cannot be discarded is defined as . Finally, when we want to apply the previous definitions to a set of models instead of single model we have, e.g., .

The proof of the following results are available in Sec. D of the supplementary material, here we only report the number of pulls, and the corresponding regret bound.

###### Corollary 1.

If at episode umUCB is run with as in Eq. 2 and as in Eq. 2 with a parameter , then for any arm , is pulled times such that

w.p. , where is the set of models for which is among theirs optimistic non-dominated arms, , (i.e., set of arms only proposed by models that can be discarded), and (i.e., set of arms only proposed by models that cannot be discarded).

The previous corollary states that arms which cannot be optimal for any optimistic model (i.e., the optimistic non-dominated arms) are never pulled by umUCB, which focuses only on arms in . Among these arms, those that may help to remove a model from the active set (i.e., ) are potentially pulled less than UCB, while the remaining arms, which are optimal for the models that cannot be discarded (i.e., ), are simply pulled according to a UCB strategy. Similar to mUCB, umUCB first pulls the arms that are more optimistic until either the active set changes or they are no longer optimistic (because of the evidence from the actual samples). We are now ready to derive the per-episode regret of umUCB.

###### Theorem 3.

If umUCB is run for steps on the set of models estimated by RTP after episodes with , and the actual model is , then its expected regret (w.r.t. the random realization in episode and conditional on ) is

 E[Rjn]≤K+∑∑i∈Aj1min{2log(2mKn3)Δi(¯θj)2,log(2mKn3)2minminθ∈Θji,+(¯θj)ˆΓi(θ;¯θj)2}Δi(¯θj)+∑∑i∈Aj22log(2mKn3)Δi(¯θj).

Remark (negative transfer). The transfer of knowledge introduces a bias in the learning process which is often beneficial. Nonetheless, in many cases transfer may result in a bias towards wrong solutions and a worse learning performance, a phenomenon often referred to as negative transfer. The first interesting aspect of the previous theorem is that umUCB is guaranteed to never perform worse than UCB itself. This implies that tUCB never suffers from negative transfer, even when the set contains highly uncertain models and might bias umUCB to pull suboptimal arms.

Remark (improvement over UCB). In Sec. 3 we showed that mUCB exploits the knowledge of to focus on a restricted set of arms which are pulled less than UCB. In umUCB this improvement is not as clear, since the models in are not known but are estimated online through episodes. Yet, similar to mUCB, umUCB has the two main sources of potential improvement w.r.t. to UCB. As illustrated by the regret bound in Thm. 3, umUCB focuses on arms in which is potentially a smaller set than . Furthermore, the number of pulls to arms in is smaller than for UCB whenever the estimated model gap is bigger than . Eventually, umUCB reaches the same performance (and improvement over UCB) as mUCB when is big enough. In fact, the set of optimistic models reduces to the one used in mUCB (i.e., ) and all the optimistic models have only optimal arms (i.e., for any the set of non-dominated optimistic arms is ), which corresponds to and , which matches the condition of mUCB. For instance, for any model , to have we need for any arm that . As a result episodes are needed in order for all the optimistic models to have only one optimal arm independently from the actual identity of the model . Although this condition may seem restrictive, in practice umUCB starts improving over UCB much earlier, as illustrated in the numerical simulation in Sec. 5.

### 4.4 Regret Analysis of tUCB

Given the previous results, we derive the bound on the cumulative regret over episodes (Eq. 1).

###### Theorem 4.

If tUCB is run over episodes of steps in which the tasks are drawn from a fixed distribution over a set of models , then its cumulative regret is

 RJ≤JK+J∑j=1∑i∈Aj1min{2log(2mKn2/δ)Δi(¯θj)2,log(2mKn2/δ)2minθ∈Θji,+(¯θj)ˆΓji(θ;¯θj)2}Δi(¯θj)+J∑j=1∑i∈Aj22log(2mKn2/δ)Δi(¯θj),

w.p. w.r.t. the randomization over tasks and the realizations of the arms in each episode.

This result immediately follows from Thm. 3 and it shows a linear dependency on the number of episodes . This dependency is the price to pay for not knowing the identity of the current task . If the task was revealed at the beginning of the task, a bandit algorithm could simply cluster all the samples coming from the same task and incur a much smaller cumulative regret with a logarithmic dependency on episodes and steps, i.e., . Nonetheless, as discussed in the previous section, the cumulative regret of tUCB is never worse than for UCB and as the number of tasks increases it approaches the performance of mUCB, which fully exploits the prior knowledge of .

## 5 Numerical Simulations

In this section we report preliminary results of tUCB on synthetic data. The objective is to illustrate and support the previous theoretical findings. We define a set of MAB problems with arms each, whose means are reported in Fig. 6 (see Sect. F in the supplementary material for the actual values), where each model has a different color and squares correspond to optimal arms (e.g., arm is optimal for model ). This set of models is chosen to be challenging and illustrate some interesting cases useful to understand the functioning of the algorithm.666Notice that although satisfies Assumption 1, the smallest singular value and , thus making the estimation of the models difficult. Models and only differ in their optimal arms and this makes it difficult to distinguish them. For arm 3 (which is optimal for model and thus potentially selected by mUCB), all the models share exactly the same mean value. This implies that no model can be discarded by pulling it. Although this might suggest that mUCB gets stuck in pulling arm 3, we showed in Thm. 1 that this is not the case. Models and are challenging for UCB since they have small minimum gap. Only 5 out of the 7 arms are actually optimal for a model in . Thus, we also report the performance of UCB+ which, under the assumption that is known, immediately discards all the arms which are not optimal () and performs UCB on the remaining arms. The model distribution is uniform, i.e., .

Before discussing the transfer results, we compare UCB, UCB+, and mUCB, to illustrate the advantage of the prior knowledge of w.r.t. UCB. Fig. 8 reports the per-episode regret of the three algorithms for episodes of different length (the performance of tUCB is discussed later). The results are averaged over all the models in and over runs each. All the algorithms use the same confidence bound . The performance of mUCB is significantly better than both UCB, and UCB+, thus showing that mUCB makes an efficient use of the prior of knowledge of . Furthermore, in Fig. 6 the horizontal lines correspond to the value of the regret bounds up to the dependent terms and constants777For instance, for UCB we compute . for the different models in averaged w.r.t. for the three algorithms (the actual values for the different models are in the supplementary material). These values show that the improvement observed in practice is accurately predicated by the upper-bounds derived in Thm. 1.

We now move to analyze the performance of tUCB. In Fig. 8 we show how the per-episode regret changes through episodes for a transfer problem with tasks of length . In tUCB we used as in Eq.2 with . As discussed in Thm. 3, UCB and mUCB define the boundaries of the performance of tUCB. In fact, at the beginning tUCB selects arms according to a UCB strategy, since no prior information about the models is available. On the other hand, as more tasks are observed, tUCB is able to transfer the knowledge acquired through episodes and build an increasingly accurate estimate of the models, thus approaching the behavior of mUCB. This is also confirmed by Fig. 6 where we show how the complexity of tUCB changes through episodes. In both cases (regret and complexity) we see that tUCB does not reach the same performance of mUCB. This is due to the fact that some models have relatively small gaps and thus the number of episodes to have an accurate enough estimate of the models to reach the performance of mUCB is much larger than 5000 (see also the Remarks of Thm. 3). Since the final objective is to achieve a small global regret (Eq. 1), in Fig. 8 we report the cumulative regret averaged over the total number of tasks () for different values of and . Again, this graph shows that tUCB outperforms UCB and that it tends to approach the performance of mUCB as increases, for any value of .

## 6 Conclusions and Open Questions

In this paper we introduce the transfer problem in the multi-armed bandit framework when a tasks are drawn from a finite set of bandit problems. We first introduced the bandit algorithm mUCB and we showed that it is able to fully exploit the prior knowledge on the set of bandit problems and reduce the regret w.r.t. UCB. When the set of models is unknown we define a method-of-moments variant (RTP) which consistently estimates the means of the models in from the samples collected through episodes. This knowledge is then transferred to umUCB which never performs worse than UCB and tends to approach the performance of mUCB. For these algorithms we derive regret and sample complexity bounds, and we show preliminary numerical simulations. To the best of our knowledge, this is the first work studying the problem of transfer in multi-armed bandit and it opens a series of interesting questions.

Optimality of mUCB. In some cases, mUCB may miss the opportunity to explore arms that could be useful in discarding models. For instance, an arm may correspond to very large gaps and few pulls to it, although leading to large regret, may be enough to discard many models, thus guaranteeing a very small regret in the following. This observation rises the question whether the optimistic approach in this case still guarantees an optimal tradeoff between exploration and exploitation. Since the focus of this paper is on transfer and mUCB is already guaranteed to perform better than UCB, we left this question for future work.

Optimality of tUCB. At each episode, tUCB transfers the knowledge about acquired from previous tasks to achieve a small per-episode regret using umUCB. Although this strategy guarantees that the per-episode regret of tUCB is never worse than UCB, it may not be the optimal strategy in terms of the cumulative regret through episodes. In fact, if is large, it could be preferable to run a model identification algorithm instead of umUCB in earlier episodes so as to improve the quality of the estimates . Although such an algorithm would incur a much larger regret in earlier tasks (up to linear), it could approach the performance of mUCB in later episodes much faster than done by tUCB. This trade-off between identification of the models and transfer of knowledge resembles the exploration-exploitation trade-off in the single-task problem and it may suggest that different algorithms than tUCB are possible.

## References

• Agarwal et al. (2012) Agarwal, A., Dudík, M., Kale, S., Langford, J., and Schapire, R. E. (2012). Contextual bandit learning with predictable rewards. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS’12).
• Anandkumar et al. (2012a) Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S., and Liu, Y.-K. (2012a). A spectral algorithm for latent dirichlet allocation. In Proceedings of Advances in Neural Information Processing Systems 25 (NIPS’12), pages 926–934.
• Anandkumar et al. (2012b) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2012b). Tensor decompositions for learning latent variable models. CoRR, abs/1210.7559.
• Anandkumar et al. (2012c) Anandkumar, A., Hsu, D., and Kakade, S. M. (2012c). A method of moments for mixture models and hidden markov models. In Proceeding of the 25th Annual Conference on Learning Theory (COLT’12), volume 23, pages 33.1–33.34.
• Audibert et al. (2012) Audibert, J.-Y., Bubeck, S., and Munos, R. (2012). Optimization for Machine Learning, chapter Bandit View on Noisy Optimization. MIT Press.
• Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47:235–256.
• Cavallanti et al. (2010) Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. (2010). Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11:2901–2934.
• Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press.
• Dekel et al. (2006) Dekel, O., Long, P. M., and Singer, Y. (2006). Online multitask learning. In Proceedings of the 19th Annual Conference on Learning Theory (COLT’06), pages 453–467.
• Garivier and Moulines (2011) Garivier, A. and Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. In Proceedings of the 22nd international conference on Algorithmic learning theory, ALT’11, pages 174–188, Berlin, Heidelberg. Springer-Verlag.
• Langford and Zhang (2007) Langford, J. and Zhang, T. (2007). The epoch-greedy algorithm for multi-armed bandits with side information. In Proceedings of Advances in Neural Information Processing Systems 20 (NIPS’07).
• Lazaric (2011) Lazaric, A. (2011). Transfer in reinforcement learning: a framework and a survey. In Wiering, M. and van Otterlo, M., editors, Reinforcement Learning: State of the Art. Springer.
• Lugosi et al. (2009) Lugosi, G., Papaspiliopoulos, O., and Stoltz, G. (2009). Online multi-task learning with hard constraints. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT’09).
• Mann and Choe (2012) Mann, T. A. and Choe, Y. (2012). Directed exploration in reinforcement learning with transferred knowledge. In Proceedings of the Tenth European Workshop on Reinforcement Learning (EWRL’12).
• Pan and Yang (2010) Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359.
• Robbins (1952) Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the AMS, 58:527–535.
• Saha et al. (2011) Saha, A., Rai, P., Daumé III, H., and Venkatasubramanian, S. (2011). Online learning of multiple tasks and their relationships. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’11), Ft. Lauderdale, Florida.
• Stewart and Sun (1990) Stewart, G. W. and Sun, J.-g. (1990). Matrix perturbation theory. Academic press.
• Wedin (1972) Wedin, P. (1972). Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99–111.

## Appendix A Table of Notation

Symbol Explanation
Set of arms
Set of models
Number of arms
Number of models
Number of episodes
Number of steps per episode
Time step
Current model
Active set of models at time
Distribution of arm
Mean of arm for model
Vector of means of model
Estimate of at time
Estimate of by RTP for model and arm at episode
Estimate of by RTP for model at episode
Estimated model of RTP after episode
Uncertainty of the estimated model by RTP at episode
Model uncertainty at time
Probability of failure
Best arm of model
Optimal value of model
Arm gap of an arm for a model
Model gap for an arm between two models and
-order moment
-order moment
Empirical -order moment
Empirical -order moment
Euclidean norm
Frobenius norm
Matrix max-norm
Pseudo-regret
The number of pulls to arm after steps of episode
Set of arms which are optimal for at least a model in a set
Set of models for which the arms in are optimal
Set of optimistic models for a given model
Set of optimal arms corresponds to
Whitening matrix of
Empirical whitening matrix
under the linear transformation
under the linear transformation
Diagonal matrix consisting of the largest eigenvalues of
Diagonal matrix consisting of the largest eigenvalues of
matrix with the corresponding eigenvectors of as its columns
matrix with the corresponding eigenvectors of as its columns
Eigenvalue of associated with
Eigenvector of associated with
Eigenvalue of associated with
Eigenvector of associated with
Set of largest eigenvalues of the matrix
Minimum eigenvalue of among the -largest
Maximum eigenvalue of
Maximum eigenvalue of
Minimum gap between the eigenvalues of
Permutation on
Set of non-dominated arms for model at episode
Set of models that cannot be discarded at episode
Set of models for which is among the optimistic non-dominated arms at episode

## Appendix B Proofs of Section 3

###### Lemma 1.

mUCB never pulls arms which are not optimal for at least one model, that is , with probability 1. Notice also that .

###### Lemma 2.

The actual model is never discarded with high-probability. Formally, the event holds with probability if

 εi,t=√12Ti,t−1log(mn2δ),

where is the number of pulls to arm at the beginning of step and .

In the previous lemma we implicitly assumed that . In general, the best choice in the definition of has a logarithmic factor with .

###### Lemma 3.

On event , all the arms , i.e., arms which are not optimal for any of the optimistic models, are never pulled, i.e., with probability .

The previous lemma suggests that mUCB tends to discard all the models in from the most optimistic down to the actual model which, on event , is never discarded. As a result, even if other models are still in , the optimal arm of is pulled until the end. Finally, we show that the model gaps of interest (see Thm. 1) are always bigger than the arm gaps.

###### Lemma 4.

For any model , .

###### Proof of Lem. 1.

From the definition of the algorithm we notice that can only correspond to the optimal arm of one model in the set . Since can at most contain all the models in , all the arms which are not optimal are never pulled. ∎

###### Proof of Lem. 2.

We compute the probability of the complementary event , that is that event on which there exist at least one step where the true model is not in . By definition of , we have that

 E={∀t,¯θ∈Θt}={∀t,∀i∈A,|μi−^μi,t|≤εi,t},

then

 P[EC] =P[∃t,i,|μi−^μi,t|≥εi,t]≤n∑t=1∑i∈AP[|μi−^μi,t|≥εi,t]=n∑t=1∑i∈A∗(Θ)P[|μi−^μi,t|≥εi,t]

where the upper-bounding is a simple union bound and the last passage comes from the fact that the probability for the arms which are never pulled is always 0 according to Lem. 1. At time , is the empirical average of the samples observed from arm