TheoreticallyGrounded Policy Advice from Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer
Abstract
Policy advice is a transfer learning method where a student agent is able to learn faster via advice from a teacher. However, both this and other reinforcement learning transfer methods have little theoretical analysis. This paper formally defines a setting where multiple teacher agents can provide advice to a student and introduces an algorithm to leverage both autonomous exploration and teacher’s advice. Our regret bounds justify the intuition that good teachers help while bad teachers hurt. Using our formalization, we are also able to quantify, for the first time, when negative transfer can occur within such a reinforcement learning setting.
TheoreticallyGrounded Policy Advice from Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer
Yusen Zhan, Haitham Bou Ammar, and Matthew E. Taylor Washington State University, Pullman, Washington Princeton University, Princeton, New Jersey yusen.zhan@wsu.edu, hammar@princeton.edu, taylorm@eecs.wsu.edu
1 Introduction
Reinforcement Learning (RL) has become a popular framework for autonomous behavior generation from limited feedback [?]. Typical RL methods learn in isolation increasing their learning times and sample complexities. Transfer learning aims to significantly improve learning by providing informative knowledge from an external source. The source of such knowledge varies from source agents to humans providing advice [?; ?]. In this paper, we focus on a framework referred to as action advice or the advice model [?]. Here, the agent (i.e., student), learning in a task, has access to a teacher (another agent or human) which can provide action suggestions to facilitate learning. Given “goodenough” teachers, such advice models have shown multiple benefits over standard RL techniques. For example, others [?; ?] show reduced learning times and sample complexities for successful behavior.
These methods, however, suffer from two main drawbacks. First, validation results are empirical in nature and not formallygrounded. We do not have fundamental understanding of these methods. Consequently, it is difficult to formally comprehend why these methods work. Second, most of these techniques require the availability of a “goodenough” (optimal) teacher to benefit the student. Unfortunately, access to such teachers is difficult in a variety of complex domains, reducing the applicability of policy advice in realworld settings.
In this paper, we remedy the aforementioned drawbacks by proposing a new framework for policy advice. Our method formally generalizes current singleteacher advice models to the multiteacher setting. Our algorithm also remedies the need for optimal teachers by exploiting both the student’s and the teacher’s knowledge. Even if the teacher is not optimal, a student, using our algorithm, is still capable of acquiring optimal behavior in a task; a property not supported by some stateoftheart methods, e.g., learning from demonstration. We theoretically and empirically analyze the performance of the proposed method and derive, for the first time, regret bounds quantifying the successfulness of action advice. We also provide theoretical justification for current methods (i.e., singleteacher models) as special case of our formulation in the appendix. Our contributions can be summarized as:

defining (formally) multiteacher advice models,

introducing novel algorithms leveraging teacher and student knowledge,

deriving the regret analysis showing reduced sample complexities,

deriving theoretical guarantees for single teacher advice models, and

quantifying negative transfer under such advice model.
Interestingly, these theoretical results justify a wellknown intuition inherent to advice models: “good teachers help while bad teachers hurt.” The results show that students can still achieve optimal behavior when being advised by bad teachers. They, however, pay an extra cost in terms of their learning times or sample complexities, relative to an optimal teacher. This should inspire researchers to adopt high quality teacher policies or avoid “bad teachers” if possible.
Given our formalization, we also derive a relation to negative transfer. We quantify, for the first time, the occurrence of negative transfer in action advice models, shedding the light on failure modes of these methods. Consequently, these results yield two claims about transfer learning. First, high quality transfer knowledge may still cause negative transfer when the target algorithm is able to outperform the source knowledge. Second, expert knowledge is important for the researchers to determine whether or not to transfer because evaluation of the transfer knowledge is usually expensive (it is equivalent to evaluating the teacher policy in the target MDP).
2 Preliminaries
2.1 Online Reinforcement Learning & Regret Model
In RL, an agent must sequentially select actions to maximize its total expected return. Such problems are formalized as a Markov decision Process (MDP), defined as , where and denote the finite state and actions spaces with a total size of and respectively, represents the probability transition kernel describing the task dynamics, and is the reward function quantifying the performance of the agent. The total expected return of an agent following an algorithm, , to compute the optimal actionselection rule from a starting state after time steps is defined as:
(1) 
with and . The goal is to determine an optimal policy, that maximizes the total expected return.
Regret Model: Similar to standard online learning, we quantify the performance of the algorithm, , by measuring its regret with respect to optimality. We define the regret of a state after time steps in terms of the expected reward as:
(2) 
where is the optimal reward acquired by following an optimal algorithm at each time step. In the general case when no reachability assumptions are imposed, it is easy to construct MDPs in which algorithms suffer high regret. Following ? [?], we remedy this problem by considering weaklycommunicating MDPs^{1}^{1}1Please note that weaklycommunicating MDPs are considered the most general among subclasses of MDPs, see ? [?]. defined as follows.
Definition 1.
An MDP is called weakly communicating in such a case where the state set can be decomposed into two subsets, and . In any state is reachable from every other state under a deterministic policy, , while states in are transient under all policies.
The optimal gain, in Equation 2, is state independent. That is, any , shares the same optimal expected reward [?], which can be solved for using:
where is an dimensional vector typically referred to as the bias vector, denotes the probability to transition from applying action , and is a unit dimensional vector. When needed, we explicitly write the dependency of and as and . We also define the span of as:
Finally, we follow ? [?] to define reachability in weakly communicating MDPs using the oneway diameter: , with being the expected number of steps needed for reaching from .
2.2 Algorithms for WeaklyCommunicating MDPs
REGAL.C is an online algorithm for weakly communicating MDPs developed by ? [?]. The basic idea is that the REGAL.C can estimate the true MDP with high probability in order to learn an optimal policy with high probability. Let be the number of stateactionstate triples that have been visited at time . Further, let to denote the initial time of the iteration . For brevity, we use and to denote and at iteration . We also use to denote the number of times a stateaction pair is visited during iteration . For each iteration , REGAL acquires a dataset as input and updates the transition probability (see Equation 3). It then constructs a set of MDPs to select from using , where is the upper bound on the span . Given the MDP, REGAL.C uses value iteration for acquiring the optimal policy. These steps are summarized in Algorithm 1.
Input: parameter , dataset and current time
Output:
(3) 
2.3 Single Teacher Advice Model
The single teacher advice model is a framework in which a student learning in an environment benefits from a teacher’s advice to speedup learning. We define such a framework as the tuple of . Here, denotes the teacher’s policy, represents the budget constraining the teacher’s advice, is the student, and is a function controlling the advice from the teacher to the student. Apart from considering single teacher models, previous work assumed optimal teachers where students always execute recommended actions. It is easy to construct complex settings in which access to optimal teachers is difficult. Consequently, we extend these works to the more realistic settings of suboptimal teachers, as we detail later.
3 Multiple Teacher Advice Model
In this section we start by extending the single teacher model of ? [?] to the multiple nonoptimal teacher setting. Our advice model for teachers is defined as the tuple , where is the set of teacher policies, and denotes the set of budgets. It is easy to see that in case and , we can easily recover the special case single teacher model. We also generalize the work of ? [?] by making no restrictive assumptions on the optimality of any of the teachers. We measure the performance of the teacher with respect to a base policy in terms of regret:
Definition 2.
Given a teacher’s policy, , and a base policy , then the regret of following is related to that acquired by following using:
where denotes the regret ratio, and
The above definition captures the three interesting cases quantifying the performance of an advicebased algorithm. If the teacher is optimal, i.e., when , is also . In case , then indicating the the teacher’s policy is at least as good as the base policy . Finally, when , implying the underperformance of the teacher. Consequently, with the correct choice of the teacher by one can still achieve successful advice even in such a generalized setting.
4 Efficient Multiteacher Advice
In this section, we propose a new algorithm which combines the advice policy and the MDPs information collected so far. This allows for an accurate framework outperforming stateoftheart techniques for policy advice. On a high level, our algorithm consists of three main steps. First, a combined policy is constructed based on multiple teachers. Second, data depending on both teacher’s advice as well as MDP information is collected. Third, a new policy is computed online.
Next, we outline each of the three steps and describe our novel algorithm. Having achieved an accurate advice model, we then rigorously analyze the theoretical aspects of our method and show a decrease in sample complexities compared to current techniques.
4.1 The GrandTeacher
Our method of policy advice constructs a grand teacher combining all teacher policies in a metapolicy. To construct the grandteacher, we use an ensemble method and design two metapolicy variations: online and offlineconstructions. Next, we detail each of the two variations.
Online GrandTeacher: In the online construction, whenever the student observes an unvisited state, , each teacher provides its policy advice of the form , for all with being the total number of teachers. The student then selects and stores the majority action from all teachers for that state . As far as budget is concerned, it is easy to see that we only require to know advice for each state in , thus . Though easy to implement and test, the online construction suffers from the potentially unrealistic need for the continuous availability of online teachers.
Offline GrandTeacher: To eliminates the need for an online teacher at each visit of a new state, the offline procedure traverses the states in the MDP for constructing the metaadvice policy. The main steps of this construction is summarized in Algorithm 2.
Note that Algorithm 2 is capable of constructing an offline metateacher but requires extra exploration in the MDP. We next show that steps are enough to explore each state in the MDP with high probability:
Theorem 1 (Sample Complexity).
If Algorithm 2 independently and uniformly explores each state , then with probability of at least , steps are sufficient to visit each state at least one time.
4.2 MultiTeacher Advice Algorithm
To improve current methods and arrive at a more realistic advice framework, we now introduce our algorithm combining the grandteacher’s policy and information attained by the student from the MDP.
Our algorithm is based on the following intuition. At the beginning of the learning process, a student requires guidance as it typically has little to no information of the task to be solved. As time progress and the student explores, the MDP can be effectively exploited for successful learning. Unfortunately, such a process is not well modeled using current methods. Here, we remedy this problem by introducing an algorithm which follows the teacher’s advice at the very beginning and then switches to a policy computed by an algorithm operating within the MDP. That is, the teacher guides the student at the beginning of the learning process and as the student gathers more experience, the teacher’s influence diminishes over time by switching into a policy computed by REGAL.C. The overall procedure is summarized in Algorithm 3. Note that our algorithm is inspired by DAGGER [?] in the sense that policies are updated by collecting data using a mixture of action selection rules (i.e., student and teacher policies). Contrary to DAGGER, however, our method collects all trajectories opposed to only collecting inconsistent actions, allowing for more accurate and efficient updates.
To leverage both the teacher’s and learned policies, we set a mixed policy of the form , for to guide the student’s dataset collection while allowing the teacher to fractionally control exploration needed to collect data at the next iteration. should typically be set so as to decay exponentially over time. This decreases the student’s reliance on the teacher and allows it to exploit the knowledge gathered from the MDP to learn better behaving policies than that of the teacher. It is for this reason that our algorithm, contrary to other methods, does not impose any optimality restrictions on the teacher. Having collected the dataset, Algorithm 3 uses REGAL.C (Algorithm 1) to update .
4.3 Theoretical Guarantees
In this section we formally derive the regret exhibited by our algorithm. At a high level, we provide two theoretical results. In the first, we consider the general teacher case, while in the second we derive a corollary of the regret for optimal teachers. We show, for the first time, better than constant improvements compared to standard learners.
Theorem 2.
Assume Algorithm 3 is running for total steps in a weakly communicating MPD starting from an initial state . Let be a parameter such that . Then, with a probability of at least , the total regret is given by: where such that , and is the ratio between the teacher’s regret and the regret exhibited by REGAL.C such that .
Proof.
Due to the space limits, we provide a proof sketch. The proof is based on the regret bound of REGAL.C. We introduce the regret ratio to reduce the grandteacher’s regret to the REGAL.C’s regret. Then, we apply the Hoeffding’s inequality to arrive at the statement of the theorem. ∎
Theorem 2 implies that the teacher improves learning as long as it is “good.” Namely, if , , which implies the student can enjoy a fraction of REGAL.C’s regret. However, if , , the student suffers more regret than the original REGAL.C algorithm. This justifies our intuition that good teachers assist learning while poor ones hamper learning. Moreover, if there exists prior knowledge that a teacher has poor performance, it would be better off for the student to neglect its advice as it will suffer extra regret.
If the teacher’s , we have the following Corollary:
Corollary 1.
If the teacher is optimal, then with at least a probability of the total regret is given by:
Remark 1.
Please note that the above theoretical results are more than a constant improvement to the regret. Notice that depends on the number of iterations which can be bounded by and of the input MDP [?]. Further, depends on the input teacher’s policy which is also an input to Algorithm 3. Consequently, it can be shown that these regret improvements exceed simple constant bounds.
5 Negative Transfer
To formalize the relation to negative transfer, we recognize that the regret ratio can be written as:
(4) 
This suggests that we can estimate the ratio by calculating and , given a policy . So, we use
to denote the regret ratio between policy and until step . At this stage, we define:

Negative transfer from policy to until steps: .

Positive transfer from policy to until steps:
To formalize negative transfer, our goal at this stage is to relate to a metric between source and target tasks. For that sake, we define: with and being the agent’s estimates of the rewards in the source and the target after steps. Consequently, an estimate to can be derived as:
and can be bounded by the Empirical Bernstein bound [?]. With a probability , we have
with , is the standard deviation of the sample , we derive
(5) 
with and are constants. Consequently, for negative transfer:
Then, assuming enough samples, negative transfer occurs if:
(6) 
The condition sheds light on the negative transfer in the sense of metric notation and provides a formal way to determine negative transfer. First, if the condition in Eq. 6 holds after evaluation, researchers should avoid the source policy to the target tasks since it may cause negative transfer. Second, if the researchers have enough expert knowledge about their working domain and transfer information, usually they can avoid this evaluation phase in practice. In short, Eq. 6 provides a formal way to understand negative transfer and justify the intuition (adopt high quality source knowledge and avoid bad teachers ) in the transfer practice.
6 Experimental Results
Given the above theoretical successes, this section provides empirical validation on three domains:
Combination Lock: We use the domain described in Figure 2 which is a variation from [?]. The experimental setting follows the caption description.
Grid World is an RL benchmark in which an agent has to navigate an grid world with the goal of reaching a goal state. We employ an grid world with a four room layout as introduced in ? [?]. The agent begins in the lower left corner of the map and navigates to the goal state being the upper right corner. To navigate, the agent has access (in each cell) to four actions transitioning it to the: north, south, west and east. Applying an action, it then transitions in that direction with a probability of 0.8 and in the other three with a probability of 0.2. In case the direction is blocked, the agent stays in the same state. Finally, the agent receives a reward of once reaching the goal state and a reward of in all others.
Block Dude is a game where an agent again navigates a maze to reach a goal state. Reaching the goal directly is impossible due to the presence of blocks restricting its movement. The agent, however, can move to the left, right, and upwards. To reach the goal state, it needs to pickup blocks and relocate them in correct positions. We use the default level 1 BURLAP [?] in which there are two blocks and maze. The agent receives a reward of in the goal state and a reward of in all other states.
6.1 Experimental Setup & Results
To construct the grand teacher, we set the total number of teachers . For each teacher, the budget, , is set to the total number of states. In Algorithm 3, the maximum number of iterations and the size of each dataset, , were set to and , respectively. Values of for were used to determine . For Algorithm 1, the confidence was set to and to . The optimal gain and the optimal bias vector can be approximated using the value function [?]. Let be the value function at iteration , . The optimal gain , where and the optimal bias vector , when is large enough. To smooth the natural variance in the student’s performance, each learning curve is averaged over independent trials of student learning. To better evaluate our method, we adopt six experimental settings by considering different teachers and learning algorithms. For teachers, we consider three forms. The first, referred to as “optimal teacher” provides optimal actions and is used by the grand teachers. The second, referred to as “worst teacher” advices the student to take actions with the lowest Qvalues, while the last randomly selects action suggestions from the set of allowed moves. We also compare our method to REGAL.C (no advice), optimal policy (without learning), and Azar’s method [?]. Please note that Azar’s method can not converge to the optimal policy and suffers loss as its performance is restricted by the teacher.
Performance, measured by the average reward, is reported in Figure 1. First, it is clear that given optimal teachers, our method exactly traces the optimal policy achieving a regret of 0. It is also important to note that in all three domains, even if the teacher was not optimal, and contrary to current techniques, our method is capable of acquiring optimal behavior. This is achievable as our method allows for learning within the multiple teacher framework.
7 Related Work on Transfer Learning
Few theoretical results on transfer and policy advice have been achieved. Closest to this work is that in ? [?], where the authors only provide empirical validations to their approach without drawing on any theoretical analysis. Given the theoretical derivations in this paper, we in fact note that the method [?] is a special case of ours considering only oneteacher advice models.
Another method considering advice under multiple teachers is that in ? [?]. ? propose a method capable of selecting the best policy from a set of teacher policies and derive sublinear regret of the form with being the total number of rounds. One drawback of their method, however, is the assumption of a “goodenough” teacher which can guide the student to optimality. Such a method may suffer huge regret if the overall quality of teacher policies is poor. It also can not obtain better policies than those of the teacher. Our algorithm remedies these problems by allowing agents to further improve, which gives them the opportunity to surpass the teacher’s performance.
Human advice is also a good source of policy advice. Usually, this method adopts the human advice as the teacher’s policy to improve the learning performance. However, these works focus on empirical validations [?; ?].
Probabilistic policy reuse is similar to our method in which the algorithm follows its own knowledge with probability and teacher’s policy with probability [?]. However, is not decaying over time, making the algorithm divergent if teacher policies are not optimal. ? introduce a policy shaping algorithm using human teachers, but focus on providing rewards rather than action advice [?]. Both of these works rely solely on empirical results.
Work on transfer for RL is also related to this paper, where we can consider policy advice as an instance of transferring from teachers to students [?]. Here, ?, for instance, propose a method to transfer high quality samples from source to target tasks using bisimulation measures [?]. Their method only transfers samples once, while our approach gradually provides advice to the student. Due to space constraints, we refer the reader to ? [?] for a comprehensive survey.
Lifelong reinforcement learning has drawn significant attention to the transfer community recently. Brunskill and Li studied online discovery problems in a lifelong learning setting [?]. BouAmmar et al. also studied such a problem and introduced constraints on the policy to compute “safe” policies [?]. Contrary to these works, in this paper, we focus on the single agent setting operating within one task.
Finally, Learning from Demonstration [?] (LfD) is also related to our work, but LfD usually assumes that the expert is optimal and the student only tries to mimic the expert.
8 Conclusion and Future Work
In this paper, we formally defined the multiteacher advice model and introduced a new algorithm which leverages teacher and student’s own knowledge in the weakly communicating MDPs. We theoretically analyzed our algorithm and showed, for the first time, that the agent can achieve optimality even when starting from nonoptimal teachers. Our results provide a theoretical justification for the intuition that “bad” teachers can hurt the learning process of the student. Also, we formally established the condition of negative transfer, shedding light on future transfer learning research, where for example, researchers can choose “good teachers” based on the Eq 6 and avoid negative transfer with prior expert knowledge.
In future, we plan on adopting other online reinforcement learning algorithms (e.g., REGAL.D [?], Rmax [?], or [?]) to replace REGAL.C. We will provide better methods to construct the “grandteacher” without exploring the whole MDP. Also, extensions to largescale MDPs may be an interesting direction for future research as well.
9 Acknowledgements
This research has taken place in part at the Intelligent Robot Learning (IRL) Lab, Washington State University. IRL research is supported in part by grants AFRL FA8750141 0069, AFRL FA87501410070, NSF IIS1149917, NSF IIS 1319412, USDA 20146702122174, and a Google Research Award.
References
 [Argall et al., 2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
 [Audibert et al., 2007] JeanYves Audibert, Rémi Munos, and Csaba Szepesvári. Tuning bandit algorithms in stochastic environments. In Algorithmic Learning Theory, pages 150–165. Springer, 2007.
 [Auer et al., 2009] Peter Auer, Thomas Jaksch, and Ronald Ortner. Nearoptimal regret bounds for reinforcement learning. In Advances in neural information processing systems, pages 89–96, 2009.
 [Azar et al., 2013] Mohammad Gheshlaghi Azar, Alessandro Lazaric, Brunskill Emma, et al. Regret bounds for reinforcement learning with policy advice. In ECML/PKDDEuropean Conference on Machine Learning and Principles and practice of knowledge Discovery in Database, 2013.
 [Bartlett and Tewari, 2009] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
 [BouAmmar et al., 2015] Haitham BouAmmar, Rasul Tutunov, and Eric Eaton. Safe policy search for lifelong reinforcement learning with sublinear regret. CoRR, abs/1505.05798, 2015.
 [Brafman and Tennenholtz, 2003] Ronen I Brafman and Moshe Tennenholtz. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. The Journal of Machine Learning Research, 3:213–231, 2003.
 [Brunskill and Li, 2015] Emma Brunskill and Lihong Li. The online discovery problem and its application to lifelong reinforcement learning. CoRR, abs/1506.03379, 2015.
 [Cakmak and Lopes, 2012] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decision tasks. In AAAI Conference on Artificial Intelligence (AAAI12), 2012.
 [Cederborg et al., 2015] Thomas Cederborg, Ishaan Grover, Charles L. Isbell, and Andrea L Thomaz. Policy Shaping With Human Teachers. In Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence (IJCAI), 2015.
 [Erez and Smart, 2008] Tom Erez and William D Smart. What does shaping mean for computational reinforcement learning? In Development and Learning, 2008. ICDL 2008. 7th IEEE International Conference on, pages 215–219. IEEE, 2008.
 [Fernández and Veloso, 2006] Fernando Fernández and Manuela Veloso. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, pages 720–727. ACM, 2006.
 [Ferrante et al., 2008] Eliseo Ferrante, Alessandro Lazaric, and Marcello Restelli. Transfer of task representation in reinforcement learning using policybased protovalue functions. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systemsVolume 3, pages 1329–1332. International Foundation for Autonomous Agents and Multiagent Systems, 2008.
 [Griffith et al., 2013] Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles Isbell, and Andrea L Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems, pages 2625–2633, 2013.
 [Kearns and Singh, 2002] Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 49(23):209–232, 2002.
 [Lazaric, 2012] Alessandro Lazaric. Transfer in reinforcement learning: a framework and a survey. In Reinforcement Learning, pages 143–173. Springer, 2012.
 [MacGlashan, 2014] James MacGlashan. The BrownUMBC Reinforcement Learning and Planning (BURLAP) http://burlap.cs.brown.edu/index.html, 2014.
 [Mohri et al., 2012] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
 [Puterman, 2005] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming (wiley series in probability and statistics). 2005.
 [Ross et al., 2010] Stéphane Ross, Geoffrey J Gordon, and J Andrew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. arXiv preprint arXiv:1011.0686, 2010.
 [Sutton and Barto, 1998] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning. MIT Press, 1998.
 [Taylor and Stone, 2009] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10:1633–1685, 2009.
 [Taylor et al., 2014] Matthew E. Taylor, Nicholas Carboni, Anestis Fachantidis, Ioannis Vlahavas, and Lisa Torrey. Reinforcement learning agents providing advice in complex video games. Connection Science, 26(1):45–63, 2014.
 [Whitehead, 1991] Steven D Whitehead. Complexity and cooperation in qlearning. In Proceedings of the Eighth International Workshop on Machine Learning, pages 363–367, 1991.
 [Zimmer et al., 2014] Matthieu Zimmer, Paolo Viappiani, and Paul Weng. TeacherStudent Framework: a Reinforcement Learning Approach. In AAMAS Workshop Autonomous Robots and Multirobot Systems, 2014.
Appendix A Proof of Theorem 1
Proof of Theorem 1.
Let denotes the event that the th state, , is not visited in the first explorations.
If we choose , where is a constant,
Let be the number of steps at least one of all state is not visited.
Apply the union bound  
Set , we have
with probability at least . ∎
Appendix B Proofs of Theorem 2 and Corollary 1
Next, we will prove Theorem 2. To review some notation, we use to denote the number of times a stateaction pair at iteration . And denotes the number of times a stateaction pair is visited during iteration . Let be the regret incurred in iteration ,
(7) 
The total regret equals
where is the number of iterations in Algorithm 3. Auer et al. show that if [?]. Let
That is, is the indicator random variable for the iteration .
Lemma 1.
Consider an iteration . Then, we can decompose the regret as two components,
where and are the regret incurred in iteration following the policy and the grandteacher policy , respectively.
Proof.
According to Equation (7), we have
At each decision step, the student agent either follows or the teacher  
is not related to state  
where is the regret that the student agent only follows and is the regret that the student agent only follows the grand teacher’s advice . ∎
Lemma 2.
The total regret has following upper bound,
(8) 
where is the ratio such that , .
Proof.
Consider an iteration . Lemma 1 implies
Therefore,
Assume that . Note that we introduce the regret ratio in Definition 2.  
Hence, we will bound and , separately. ∎
Lemma 3 (Theorem in [?] ).
With probability at least , the total regret of REGAL.C algorithm satisfies
where is the decaying variable in Algorithm 3.
The above Lemmas gives the upper bound of , which is given by Theorem in [?]. For , we have following bound:
Lemma 4.
With probability at least ,
Proof.
Barlett and Tewari gives following bound [?], with probability
With this result and ,
(9) 
Since the first term
dominates the righthand side, we need to bound it carefully. Let
and
are independent random variables with such that . Then apply Hoeffding’s inequality [?], we obtain, with probability at least ,
(10) 
Due to the linearity of expectation,
Combining this with Eq. (10),
Plugging it into Eq. (9), we get
(11) 
Eq () in [?] gives
Substitute this for Eq. (11) and Eq. (20) in [?] ( if ), yielding,
with probability . ∎
Proof of Theorem 2.
Proof of Corollary 1.
If the teacher is optimal, then , that is for . Therefore, . The result follows. ∎
Appendix C Domains GUI Examples
Here we provide the GUI examples of Grid World and Block Dude which are used as the experimental domains in the main paper.