Learning to Switch Between Machines and Humans

# Learning to Switch Between Machines and Humans

###### Abstract

Reinforcement learning algorithms have been mostly developed and evaluated under the assumption that they will operate in a fully autonomous manner—they will take all actions. However, in safety critical applications, full autonomy faces a variety of technical, societal and legal challenges, which have precluded the use of reinforcement learning policies in real-world systems. In this work, our goal is to develop algorithms that, by learning to switch control between machines and humans, allow existing reinforcement learning policies to operate under different automation levels. More specifically, we first formally define the learning to switch problem using finite horizon Markov decision processes. Then, we show that, if the human policy is known, we can find the optimal switching policy directly by solving a set of recursive equations using backwards induction. However, in practice, the human policy is often unknown. To overcome this, we develop an algorithm that uses upper confidence bounds on the human policy to find a sequence of switching policies whose total regret with respect to the optimal switching policy is sublinear. Simulation experiments on two important tasks in autonomous driving—lane keeping and obstacle avoidance—demonstrate the effectiveness of the proposed algorithms and illustrate our theoretical findings.

## 1 Introduction

In recent years, reinforcement learning algorithms have achieved, or even surpassed, human performance in a variety of computer games by taking decisions autonomously, without human intervention (Mnih et al., 2015; Silver et al., 2016, 2017; Vinyals et al., 2019). Motivated by these successful stories, there has been a tremendous excitement on the possibility of using reinforcement learning algorithms to operate fully autonomous cyberphysical systems, especially in the context of autonomous driving. Unfortunately, a number of technical, societal and legal challenges have precluded this possibility to become so far a reality, humans are still more skilled drivers than machines, and the vast majority of work has focused on toy examples in controlled synthetic car simulator environments (Wymann et al., 2000; Dosovitskiy et al., 2017; Talpaert et al., 2019).

In this work, we argue that existing reinforcement learning algorithms may still enhance the operation of cyberphysical systems if deployed under lower automation levels. In other words, if we let algorithms take some of the actions and leave the remaining ones to humans, the resulting performance may be better than the performance algorithms and humans would achieve on their own (Raghu et al., 2019a; De et al., 2020). However, once we depart from full automation, we need to address the following question: when should we switch control between a machine and a human? In this work, our goal is to develop algorithms that learn to optimally switch control automatically. However, to this aim, we need to address several challenges:

Amount of human control. In each application, what is considered an appropriate and tolerable load for humans may differ (European Parliament, 2006). Therefore, we would like that our algorithms provide mechanisms to control the amount of human control (or the level of automation) during a given time period.

Number of switches. Consider two different switching patterns resulting in the same performance and amount of human control.111For simplicity, we will assume that the human policy does not change due to switching. Then, we would like that our algorithms favor the pattern with the least number of switches since, every time a machine defers (takes) control to (from) a human, there is an additional cognitive load for the human (Brookhuis et al., 2001).

Unknown human policies. The spectrum of human abilities spans a broad range (Macadam, 2003). As a result, there is a wide variety of potential human policies. Here, we would like that our algorithms learn personalized switching policies that, over time, adapt to the particular human they are dealing with.

To tackle these challenges, we first formally define the learning to switch problem using finite horizon Markov decision processes. Under this definition, the problem reduces to finding the switching policy that provides an optimal trade off between environmental cost, the amount of human control and the number of switches. Then, we make the following contributions. We show that, if the human policy is known, we can find the optimal switching policy directly by solving a set of recursive Bellman equations using backwards induction. However, in practice, the human policy is often unknown, as discussed previously. To overcome this, we develop an algorithm that uses upper confidence bounds on the human policy to find a sequence of switching policies whose total regret with respect to the optimal switching policy is sublinear. Finally, we experiment with simulated data in two important tasks in semi-autonomous driving—lane keeping and obstacle avoidance—and demonstrate the effectiveness of the proposed algorithms as well as illustrate our theoretical findings222To facilitate research in this area, we will release an open-source implementation of our algorithms with the final version of the paper..

Related work. There is a rapidly increasing line of work on learning to defer decisions in the machine learning literature (Bartlett and Wegkamp, 2008; Cortes et al., 2016; Geifman et al., 2018; Geifman and El-Yaniv, 2019; Raghu et al., 2019b; Ramaswamy et al., 2018; Thulasidasan et al., 2019; Raghu et al., 2019a; Liu et al., 2019; De et al., 2020). However, previous work has typically focused on supervised learning. More specifically, it has developed classifiers that learn to defer either by considering the defer action as an additional label value, by training an independent classifier to decide about deferred decisions, or by reducing the problem to a combinatorial optimization problem. Moreover, except for two very recent notable exceptions (Raghu et al., 2019a; De et al., 2020), they do not consider there is a human decision maker who takes a decision whenever the classifiers defer it. In contrast, we focus on reinforcement learning and develop algorithms that learn to switch control between a human policy and a machine policy.

Our work contributes to an extensive body of work on human-machine collaboration (Macindoe et al., 2012; Nikolaidis et al., 2015; Hadfield-Menell et al., 2016; Nikolaidis et al., 2017; Grover et al., 2018; Wilson and Daugherty, 2018; Haug et al., 2018; Tschiatschek et al., 2019; Kamalaruban et al., 2019; Radanovic et al., 2019; Ghosh et al., 2020). However, rather than developing algorithms that learn to switch control between human and machines, previous work has predominantly considered settings in which the machine and the human interact with each other. Finally, our work also relates to a recent line of work that combines deep reinforcement learning with opponent modeling to robustly switch between multiple machine policies (Everett and Roberts, 2018; Zheng et al., 2018). However, this line of work does not consider there is a human policy neither derives theoretical guarantees on the performance of the proposed algorithms.

## 2 Problem Formulation

Our starting point is the following problem setting, which fits a variety of real-world applications. At each time step t\in\{0,\ldots,T-1\}, our (cyberphysical) system is characterized by a state s_{t}\in\Scal, where \Scal is a finite state space, and a control switch d_{t}\in\Dcal, with \Dcal=\{\mathbb{H},\mathbb{M}\}, which determines who takes an action a_{t}\in\Acal, where \Acal is a finite action space. More specifically, the switch value d_{t} is sampled from a (time-varying) switching policy \pi_{t}(d_{t}{\,|\,}s_{t},d_{t-1}). If d_{t}=\mathbb{H}, the action a_{t} is sampled from a human policy p_{\mathbb{H}}(a_{t}{\,|\,}s_{t}) and, if d_{t}=\mathbb{M}, it is sampled from a machine policy p_{\mathbb{M}}(a_{t}{\,|\,}s_{t}). Throughout the paper, we will assume that the machine policy p_{\mathbb{M}} is known. Moreover, given a state s_{t} and an action a_{t}, the state s_{t+1} is sampled from a transition probability p(s_{t+1}{\,|\,}s_{t},a_{t}). Finally, given a trajectory of switching patterns and states \tau=\{(s_{t},d_{t})\}_{t=0}^{T-1} and an initial state (s_{0},d_{-1}), we define the total cost c(\tau{\,|\,}s_{0},d_{-1}) as:

 c(\tau{\,|\,}s_{0},d_{-1})=\sum_{t=0}^{T-1}\bar{c}(s_{t},d_{t})+\lambda_{1}\II% \left[d_{t}=\mathbb{H}\right]+\lambda_{2}\II\left[d_{t}\neq d_{t-1}\right], (1)

where the first term \bar{c}(s_{t},d_{t})=\EE_{a_{t}\,\sim\,p_{d_{t}}(\cdot{\,|\,}s_{t})}\left[c{{}% ^{\prime}}(s_{t},a_{t})\right] is the expected environment cost of switch value d_{t} at state s_{t}, c{{}^{\prime}}(s_{t},a_{t}) is the environment cost of action a_{t} at state s_{t}, the second and third terms penalize the amount of human control and number of switches, respectively, and the parameters \lambda_{1} and \lambda_{2} control the trade off between the expected environmental cost, the amount of human control and the number of switches.

Next, we characterize the above problem setting using a finite horizon Markov decision process (MDP) \Mcal=(\Scal\times\Dcal,\Dcal,P_{\pi|p_{\mathbb{H}},p_{\mathbb{M}}},C_{\pi|p_{% \mathbb{H}},p_{\mathbb{M}}},T), where \Scal\times\Dcal is an augmented state space, the set of actions \Dcal is just the switch values, the transition dynamics P_{\pi|p_{\mathbb{H}},p_{\mathbb{M}}} at time t are given by

 \displaystyle p_{\pi_{t}|p_{\mathbb{H}},p_{\mathbb{M}}}(s_{t+1},d_{t}{\,|\,}s_% {t},d_{t-1}) \displaystyle=\pi_{t}(d_{t}{\,|\,}s_{t},d_{t-1})p(s_{t+1}{\,|\,}s_{t},d_{t}) \displaystyle=\pi_{t}(d_{t}{\,|\,}s_{t},d_{t-1})\times\sum_{a\in\Acal}p_{d_{t}% }(a{\,|\,}s_{t})p(s_{t+1}{\,|\,}s_{t},a),

the immediate cost C_{\pi|p_{\mathbb{H}},p_{\mathbb{M}}} at time t is given by

 c_{\pi_{t}}(s_{t},d_{t-1})=\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s_{t},d_{t-1% })}[\bar{c}(s_{t},d^{\prime})+\lambda_{1}\II(d^{\prime}=\mathbb{H})+\lambda_{2% }\II(d^{\prime}\neq d_{t-1})], (2)

and T is the time horizon. Note that, by using conditional expectations, we can compute the average cost of a trajectory, given by Eq. 1, from the above immediate costs and, while the set of actions \Acal of the machine and the human policy can be large, the action state of this Markov decision process is binary (\ie, \mathbb{H} and \mathbb{M}).

Then, our goal is to find the optimal switching policy \pi^{*}=(\pi^{*}_{0},\ldots,\pi^{*}_{T-1}) that maximizes the expected cost, as defined by Eq. 1, \ie,

 \pi^{*}=\argmin_{\pi}\EE_{\tau\sim P_{\pi|p_{\mathbb{H}},p_{\mathbb{M}}}}\left% [c(\tau{\,|\,}s_{0},d_{-1})\right], (3)

where the expectation is taken over all the trajectories induced by the switching policy given the human and machine policies p_{\mathbb{H}} and p_{\mathbb{M}}.

## 3 Known Human Policy

In this section, we address the problem of learning to switch, as defined in Eq. 3, under the assumption that the human policy p_{\mathbb{H}} is known.

Given a finite horizon Markov decision process \Mcal=(\Scal\times\Dcal,\Dcal,P_{\pi|p_{\mathbb{H}},p_{\mathbb{M}}},C_{\pi|p_{% \mathbb{H}},p_{\mathbb{M}}},T), as defined in Section 2, and a switching policy \pi=(\pi_{0},\ldots,\pi_{T-1}), we define the optimal value function v_{t}(s,d) for each t\in\{0,\ldots,T-1\} as

 v_{t}(s,d)=\hskip-8.535827pt\min_{\pi_{t},\ldots,\pi_{T-1}}\hskip-8.535827pt% \EE\left[\sum_{t^{\prime}=t}^{T-1}c_{\pi_{t^{\prime}}}(s_{t^{\prime}},d_{t^{% \prime}-1}){\,|\,}s_{t}=s,d_{t-1}=d\right]. (4)

In the above, note that we can directly recover the objective function in Eq. 3 from v_{0}(s_{0},d_{-1}). Moreover, using Bellman’s principle of optimality, it is easy to show that v_{t}(s,d) satisfies the following recursive equation (refer to Appendix A.1):

 \displaystyle v_{t}(s,d) \displaystyle=\min_{\pi_{t}(\cdot{\,|\,}s,d)}\left\{c_{\pi_{t}}(s,d)+\EE_{d^{% \prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime% })}\left[v_{t+1}(s^{\prime},d^{\prime})\right]\right\}, (5)

with v_{T}(s,d)=0 for all s\in\Scal and d\in\Dcal.

Then, we can directly solve the minimization problem in Eq. 5 and express the optimal switching policy \pi^{*}_{t} in terms of the optimal value function v_{t+1}. More specifically, for any s\in\Scal and d\in\Dcal, define

 \displaystyle q_{t{\,|\,}\mathbb{M}}(s,d) \displaystyle=\bar{c}(s,\mathbb{M})+\lambda_{2}\II(\mathbb{M}\neq d)+\EE_{a% \sim p_{\mathbb{M}}(\cdot{\,|\,}s),s^{\prime}\sim p(\cdot{\,|\,}s,a)}\left[v_{% t+1}(s^{\prime},\mathbb{M})\right] (6) \displaystyle q_{t{\,|\,}\mathbb{H}}(s,d) \displaystyle=\bar{c}(s,\mathbb{H})+\lambda_{1}+\lambda_{2}\II(\mathbb{H}\neq d% )+\EE_{a\sim p_{\mathbb{H}}(\cdot{\,|\,}s),s^{\prime}\sim p(\cdot{\,|\,}s,a)}% \left[v_{t+1}(s^{\prime},\mathbb{H})\right] (7)

Then, we have the following proposition (proven in Appendix A.2): {proposition} For any s\in\Scal and d\in\Dcal, the optimal switching policy \pi^{*}_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)=1 if

 q_{t{\,|\,}\mathbb{M}}(s,d)

and \pi^{*}_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)=0 otherwise. Using the above result, we can find the optimal switching policy \pi^{*}=(\pi^{*}_{0},\ldots,\pi^{*}_{T-1}) using backwards induction, starting with v_{T}(s,d)=0 for all s\in\Scal and d\in\Dcal. Here, note that the expectations in Eqs. 6 and 7 do not depend on the switching policy but only on the machine and human policies, which are known. Algorithm 1 summarizes the whole procedure.

Remarks. In practice, to implement Algorithm 1, we only need to have access to (off-policy) historical driving data about the human driver, rather than explicitly fitting a (parameterized) model for the human policy. This is because Algorithm 1 only depends on the human policy through two expectations (lines 7-8). Therefore, we can use the historical data to compute a finite sample Monte-Carlo estimator of these expectations.

## 4 Unknown Human Policy

In this section, we address the problem of learning to switch, as defined in Eq. 3, in a more realistic setting in which the human policy is unknown—we do not know the particular human driver we are dealing with.

When we do not know the human policy our switching policy is dealing with, we need to trade off exploitation, \ie, minimizing the expected cost, and exploration, \ie, learning about the human policy. To this end, we look at the problem from the perspective of episodic learning and proceed as follows.

We consider K independent subsequent episodes of length L and denote the aggregate length of all episodes as T=KL. Each of these episodes corresponds to a realization of the same finite horizon Markov decision process \Mcal=(\Scal\times\Dcal,\Dcal,P_{\pi|p^{*}_{\mathbb{H}},p_{\mathbb{M}}},C_{\pi% |p^{*}_{\mathbb{H}},p_{\mathbb{M}}},L), where p^{*}_{\mathbb{H}} denotes the true human policy. However, since the true human policy p^{*}_{\mathbb{H}} is unknown to us, just before each episode k starts, our goal is to find a switching policy \pi^{k} with desirable properties in terms of total regret R(T), which is given by:

 R(T)=\sum_{k=1}^{K}\left[\EE_{\tau\sim P_{\pi^{k}|p^{*}_{\mathbb{H}},p_{% \mathbb{M}}}}\left[c(\tau{\,|\,}s_{0},d_{-1})\right]-\EE_{\tau\sim P_{\pi^{*}|% p^{*}_{\mathbb{H}},p_{\mathbb{M}}}}\left[c(\tau{\,|\,}s_{0},d_{-1})\right]% \right], (8)

where \pi^{*} is the optimal switching policy under the true human policy p^{*}_{\mathbb{H}}. To achieve our goal, we apply the principle of optimism in the face of uncertainty, \ie,

 \pi^{k}=\argmin_{\pi}\min_{p_{\mathbb{H}}\in\Pcal^{k}_{\mathbb{H}}}\EE_{\tau% \sim P_{\pi|p_{\mathbb{H}},p_{\mathbb{M}}}}\left[c(\tau{\,|\,}s_{0},d_{-1})\right] (9)

where \Pcal^{k}_{\mathbb{H}} is a (|\Scal|\times L)-rectangular confidence set, \ie, \Pcal^{k}_{\mathbb{H}}=\bigtimes_{s,t}\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}. Here, note that the confidence set is constructed using data gathered during the first k-1 episodes and allows for time-varying human policies p_{\mathbb{H}}(\cdot{\,|\,}s,t). However, to solve Eq. 9, we first need to explicitly define the confidence set. To this end, we first define the empirical distribution \hat{p}^{k}_{\mathbb{H}}(\cdot{\,|\,}s) of the true human policy p^{*}_{\mathbb{H}}(\cdot{\,|\,}s) just before episode k starts as:

 \displaystyle\hat{p}^{k}_{\mathbb{H}}(a{\,|\,}s) \displaystyle=\begin{cases}\frac{N_{k}(s,a)}{N_{k}(s)}&\text{if }N_{k}(s)\neq 0% \\ \frac{1}{|\Acal|}&\text{otherwise},\end{cases} (10)

where

 \displaystyle N_{k}(s) \displaystyle=\sum_{l=1}^{k-1}\sum_{t\in[L]}\II(s_{t}=s,d_{t}=\mathbb{H}% \textnormal{ in episode }l),\text{and} \displaystyle N_{k}(s,a) \displaystyle=\sum_{l=1}^{k-1}\sum_{t\in[L]}\II(s_{t}=s,a_{t}=a,d_{t}=\mathbb{% H}\textnormal{ in episode }l).

Then, similarly as in Jaksch et al. (2010), we opt for a L^{1} confidence set333This choice will result into a sequence of switching policies with desirable properties in terms of total regret. \Pcal^{k}_{\mathbb{H}}(\delta)=\bigtimes_{s,t}\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}% (\delta) with

 \Pcal^{k}_{\mathbb{H}{\,|\,}s,t}(\delta)=\left\{\,p_{\mathbb{H}}\,:\,||p_{% \mathbb{H}}(\cdot{\,|\,}s,t)-\hat{p}^{k}_{\mathbb{H}}(\cdot{\,|\,}s)||_{1}\leq% \beta_{k}(s,\delta)\right\}, (11)

for all s\in\Scal and t\in[L], where \delta is a given parameter and

 \beta_{k}(s,\delta)=\sqrt{\frac{14|\Acal|\log\left(\frac{(k-1)L|\Scal|}{\delta% }\right)}{\max\{1,N_{k}(s)\}}}. (12)

Moreover, for each episode k, we define the optimal value function v^{k}_{t}(s,d) as

 v^{k}_{t}(s,d)=\min_{\pi_{t},\ldots,\pi_{L-1}}\min_{p_{\mathbb{H}}\in\Pcal^{k}% _{\mathbb{H}}(\delta)}\EE\left[\sum_{t^{\prime}=t}^{L-1}c_{\pi_{t^{\prime}}}(s% _{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t}=s,d_{t-1}=d\right] (13)

Then, we are ready to use the following key theorem (proven in Appendix A.3), which gives a solution to Eq. 9: {theorem} For any episode k, the optimal value function v^{k}_{t}(s,d) satisfies the following recursive equation:

 \displaystyle v^{k}_{t}(s,d)=\min\left(q^{k}_{t{\,|\,}\mathbb{M}}(s,d),\min_{p% _{\mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}q^{k}_{t{\,|% \,}\mathbb{H}}(s,d)\right)

where

 \displaystyle q^{k}_{t{\,|\,}\mathbb{M}}(s,d) \displaystyle=\bar{c}(s,\mathbb{M})+\lambda_{2}\II(\mathbb{M}\neq d)+\EE_{a% \sim p_{\mathbb{M}}(\cdot{\,|\,}s),\,s^{\prime}\sim p(\cdot{\,|\,}s,a)}\left[v% ^{k}_{t+1}(s^{\prime},\mathbb{M})\right] \displaystyle q^{k}_{t{\,|\,}\mathbb{H}}(s,d) \displaystyle=\bar{c}(s,\mathbb{H})+\lambda_{1}+\lambda_{2}\II(\mathbb{H}\neq d% )+\EE_{a\sim p_{\mathbb{H}}(\cdot{\,|\,}s,t),\,s^{\prime}\sim p(\cdot{\,|\,}s,% a)}\left[v^{k}_{t+1}(s^{\prime},\mathbb{H})\right]

with v^{k}_{L}(s,d)=0 for all s\in\Scal and d\in\Dcal. Moreover, for any s\in\Scal and d\in\Dcal, the optimal switching policy \pi^{k}_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)=1 if

 q^{k}_{t{\,|\,}\mathbb{M}}(s,d)<\min_{p_{\mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal^% {k}_{\mathbb{H}{\,|\,}s,t}}q^{k}_{t{\,|\,}\mathbb{H}}(s,d)

and \pi^{k}_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)=0 otherwise. The above result readily implies that, just before each episode k starts, we can find the optimal switching policy \pi^{k}=(\pi^{k}_{0},\ldots,\pi^{k}_{L-1}) using backwards induction, starting with v_{L}(s,d)=0 for all s\in\Scal and d\in\Dcal. Moreover, similarly as in Strehl and Littman (2008), we can solve the minimization problem \min_{p_{\mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}q^{k}% _{t{\,|\,}\mathbb{H}}(s,d) analytically using the following Lemma (proven in Appendix A.4): {lemma} Consider the following minimization problem:

where d\geq 0, b_{i}\geq 0\,\,\forall i\in\{1,\ldots,m\}, \sum_{i}b_{i}=1 and 0\leq w_{1}\leq w_{2}\dots\leq w_{m}. Then, the solution to the above minimization problem is given by:

 \displaystyle\begin{split}\displaystyle x^{*}_{i}=\begin{cases}\,\min\{1,b_{1}% +\frac{d}{2}\}&\mbox{if}\,\,i=1\\ \,b_{i}&\mbox{if}\,\,i>1\,\text{and}\,\sum_{l=1}^{i}x_{l}\leq 1\\ \,0&\mbox{otherwise.}\end{cases}\end{split} (15)

More specifically, to apply the lemma, we just need to consider m=|\Acal| and, for all i\in\{1,\ldots,m\}, x_{i}=p_{\mathbb{H}}(a_{i}{\,|\,}s,t), w_{i}=\EE_{s^{\prime}\sim p(\cdot|s,a_{i})}[v^{k}_{t+1}(s^{\prime},\mathbb{H})], b_{i}=\hat{p}^{k}_{\mathbb{H}}(a_{i}{\,|\,}s) and d=\beta_{k}(s,\delta). Algorithm 2 summarizes the whole procedure.

Within the algorithm, the function \textsc{GetOptimal}(\cdot) finds the optimal policy \pi^{k} using backwards induction, similarly as in Algorithm 1, however, in contrast with Algorithm 1, it computes q_{\mathbb{H}} by solving a minimization problem using Lemma 4. Moreover, it is important to notice that, in lines 718, the switching policy \pi^{k} is actually deployed, the machine and the true human take actions and, as a result, action data from the true human is gathered.

Finally, the following theorem shows that the sequence of policies \{\pi^{k}\}_{k=1}^{K} found by Algorithm 2 achieve sublinear total regret, as defined in Eq. 8 (proven in Appendix A.5): {theorem} Assume we use Algorithm 2 to find the switching policies \pi^{k}. Then, with probability at least 1-\delta, it holds that

 R(T)\leq\rho L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}\right)}, (16)

where \rho>0 is a constant.

## 5 Experiments

In this section, we perform a variety of simulation experiments in autonomous driving. Our goal here is to demonstrate that the switching policies found by Algorithms 1 and 2 enable the resulting cyberphysical system to successfully perform lane keeping and obstacle avoidance444We ran all experiments on a machine equipped with Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz and 12 GB memory..

Environment setup. We consider three types of lane driving environments, as illustrated in Figure 4. Each type of environment requires different driving skills. For example, in the environment (a), the cyberphysical system only needs to perform lane keeping to drive through the traffic-free road. In contrast, in the environments (b-c), it needs to perform both lane keeping and obstacle avoidance to drive through heavy traffic and avoid complex obstacles such as stones. In each of these three lane environments, there are three lanes, 3\times 10 cells and the type of each individual cell (\ie, road, car, stone or grass) is sampled independently at random with a probability that depends on the type of environment.

The goal of the cyberphysical system is to drive the car from an initial state at the bottom of the lane to the top of the lane. At any given time t, we assume that whoever is in control—be it the machine or the human—can take three different actions \Acal=\{\texttt{left, straight, right}\}. Action left steers the car to the left of the current lane, action right steers it to the right and action straight leaves the car in the current lane. If the car is already on the leftmost (rightmost) lane when taking action left (right), then the lane is randomly chosen with probability 0.5. Irrespective of the action taken, the car always moves forward. Therefore, whenever the human policy is known, we have T=9, and, whenever it is unknown, we have L=9.

State space. To evaluate the switching policies found by Algorithm 1, we experiment both with a cell-based and a sensor-based state space and, to evaluate the switching policies found by Algorithm 2, we experiment only with a sensor-based state space.

Cell-based state space: We characterize each individual lane driving environment using a different cell-based state space, where each cell within the environment represents a state. Therefore, the resulting MDP has 3\times 10 states. This choice of state space is transductive since it can only be used in a single environment and the resulting switching policy cannot be applied in a different environment555One could think of defining a cell-based state space for a set of multiple lane driving environments, however, the resulting state space could only be used in those environments in the set..

Sensor-based state space: We characterize all lane driving environments using the same sensor-based state space representation. More specifically, the state values are just the type of the current cell and the three cells the car can move into in the next time step, \eg, assume at time t the car is on a road cell and, if it moves forward left, it hits a stone, if it moves forward straight, it hits a car, and, if it moves forward right, it drivers over grass, then its state value is s_{t}=(\texttt{road},\texttt{stone},\texttt{car},\texttt{grass}). Moreover, if the car is on the leftmost (rightmost) lane, then we set the value of the second (fourth) dimension in s_{t} to \emptyset. Therefore, the resulting MDP has \sim4^{4} states and, given a type of driving environment, we can compute the transition probabilities p(s_{t+1}{\,|\,}s_{t},a_{t}) analytically. This choice of state space representation is inductive since it can be used across different environments and the same switching policy can be applied across multiple environments666Note that a switching policy that is optimal for a type (or types) of lane driving environments may be suboptimal in other types of environments..

Cost and human/machine policies. Under both state space representations, we consider a state-dependent environment cost \bar{c}(s_{t},d_{t})=\bar{c}(s_{t}) that depends on the type of the cell the car is on at state s_{t}, \ie, \bar{c}(s_{t})=0 if the type of the current cell is road, \bar{c}(s_{t})=2 if it is grass, \bar{c}(s_{t})=4 if it is stone and \bar{c}(s_{t})=5 if it is car. Moreover, we consider that whoever is in control—be it the machine or the human—pick which action to take (left, straight or right) according to a noisy estimate of the environment cost of the three cells that the car can move into in the next time step. More specifically, the machine computes a noisy estimate of the cost \hat{c}(s)=\bar{c}(s)+\epsilon_{s} of each of the three cells the car can move into, where \epsilon_{s}\sim N(0,\sigma_{\mathbb{M}}), and picks the action that moves the car to the cell with the lowest noisy estimate. The human also computes a noisy estimate of the costs \hat{c}(s)=\bar{c}(s)+\epsilon_{s}, where \epsilon_{s}\sim N(0,\sigma_{\mathbb{H}}) with \sigma_{\mathbb{H}}<\sigma_{\mathbb{M}}, however, she does not always pick the action that moves the car to the cell with the lowest noisy estimate. In particular, if there is a car in either of the three cells that she can move into, she panics and moves to the cell where the car is with probability p_{\text{panic}}. In other words, we assume the human driver is generally more reliable than the machine, however, when there is an imminent danger (\ie, a potential crash against another car), the machine is more reliable than the human. Throughout our experiments, if not said otherwise, we set \sigma_{\mathbb{H}}=1, \sigma_{\mathbb{M}}=4 and p_{\text{panic}}=0.4. Finally, without loss of generality, we consider that only the car driven by our cyberphysical system moves in the environment.

Insights into the optimal switching policies. First, we assume the human policy is known and use Algorithm 1 to find the optimal switching policies in a variety of of lane driving environments using both a cell-based and a sensor-based state space. Figure 7 shows the trajectories induced by the optimal switching policies for several types of lane driving environments and different values of the parameters \lambda_{1} and \lambda_{2}, which control the trade off between the expected environment cost, the amount of human control and the number of switching. Here, whenever we use the sensor-based state space, note that we obtain an optimal switching policy for each type of environments, rather than a single environment. However, for ease of visualization, we show the trajectories induced by the policy in a single individual environment per environment type, picked at random. The results reveal several interesting insights. In the absence of traffic (Environment 1), the optimal switching policy gives control to the human most of the time as long as the cost of human control, set by \lambda_{1}, is not too large (\lambda_{1}=0.6 vs \lambda_{1}=0.2). However, this is not surprising since, in our experimental setup, the human policy is always better than the machine policy in absence of traffic. Whenever there is traffic (Environments 2 and 3), the optimal switching policy gives control to the machine whenever there is an imminent danger (\ie, a potential crash against another car). This happens because, in such situation, the human policy is worse than the machine policy in our experimental setup. In our experiments, we did not find noticeable differences among the trajectories based on cell-based and sensor-based state. However, to implement the former, we need to solve one set of equations for each individual environment while, to implement the latter, we just need to solve one set of equations per type of environment.

Next, we assume the human policy is unknown and use Algorithm 2 to find a sequence of switching policies with sublinear regret in a variety of lane driving environments using a sensor-based state space. Figure 10 shows the trajectories induced by the switching policies found by our algorithm across different episodes within a sequence for different values of the parameters \lambda_{1} and \lambda_{2}. The results show that, in the latter episodes, the algorithm has learned to rely on the machine (blue segments) to drive whenever there is an imminent danger (\ie, a potential crash against another car). Moreover, whenever the amount of human control and number of switches is not penalized (\ie, \lambda_{1}=\lambda_{2}=0), the algorithm switches to the human more frequently in order to reduce the environment cost.

Quantitative performance. We first consider that the human policy is known and evaluate the performance achieved by the (optimal) switching policies found using Algorithm 1. More specifically, we investigate the influence that the quality of the human driver, as tuned by the noise variance \sigma_{\mathbb{H}}, has on the number of switches and the amount of human control. Figure 13 summarizes the results for several types of environments and values of the parameters \lambda_{1} and \lambda_{2} using a sensor-based state space. As one could perhaps wished for, we find that, if the human driver is less (more) skilled, the optimal switching policy decides to reduce (increase) the amount of human control and number of switches. Moreover, whenever the amount of human control and number of switches is penalized (\ie, \lambda_{1}>0, \lambda_{2}>0), the algorithm is stricter with the human driver and relies entirely on the machine for \sigma_{\mathbb{H}}\geq 3.

Next, we assume we do not have any prior knowledge on the human policy and evaluate the performance achieved by the sequence of policies using Algorithm 2 under a sensor-based state space. To this aim, we compare the total regret achieved by the sequence of policies, as defined in Eq. 8, and that achieved by a greedy baseline, which just finds the optimal policy at each episode k using Algorithm 1 with \hat{p}^{k}_{\mathbb{H}}, as defined in Eq. 10, as human policy. Figure 16 summarizes the results for two types of environments and values of parameters \lambda_{1} and \lambda_{2}. As expected, the sequence of policies found by Algorithm 2 achieve sublinear regret while those found by the greedy baseline, due to a lack of exploration, achieve linear regret. However, whenever the number of switches and the amount of human control is penalized (\ie, \lambda_{1}>0, \lambda_{2}>0), the human is in control less time and Algorithm 2 takes longer to accurately estimate how skilled is the human driver is dealing with. As a result, its competitive advantage with respect to the greedy algorithm only becomes apparent after 2{,}000 episodes.

## 6 Conclusions

In this work, we have tackled the problem of learning to switch control between machines and humans in sequential decision making. After formally defined the learning to switch problem using finite horizon MDPs, we have first shown that, if the human policy is known, the optimal switching policy can be found just by solving a set of recursive equations using backwards induction. Then, we have developed an algorithm that, without prior knowledge of the human policy, it is able to find a sequence of switching policies whose total regret is sublinear. Finally, we have performed a variety of simulation experiments on autonomous driving to show the effectiveness of our algorithms and illustrate our theoretical results.

Our work opens up many interesting avenues for future work. For example, in this work, we have assumed that the machine policy is fixed. However, there are reasons to believe that simultaneously optimizing the machine policy and the switching policy may lead to superior performance (De et al., 2020). Throughout the paper, we have assumed that the state space is discrete. It would be very interesting to lift this assumption and develop approximate value iteration methods to solve the learning to switch problem. Moreover, we have considered that the human policy does not change due to switching control, however, this assumption is often violated in practice (Wolfe et al., 2019). Finally, it would be interesting to assess the performance of our algorithms using interventional experiments on a real-world (semi-)autonomous driving system.

## References

• Bartlett and Wegkamp [2008] P. Bartlett and M. Wegkamp. Classification with a reject option using a hinge loss. JMLR, 2008.
• Brookhuis et al. [2001] K. Brookhuis, D. De Waard, and W. Janssen. Behavioural impacts of advanced driver assistance systems–an overview. European Journal of Transport and Infrastructure Research, 1(3), 2001.
• Cortes et al. [2016] C. Cortes, G. DeSalvo, and M. Mohri. Learning with rejection. In ALT, 2016.
• De et al. [2020] A. De, P. Koley, N. Ganguly, and M. Gomez-Rodriguez. Regression under human assistance. In AAAI, 2020.
• Dosovitskiy et al. [2017] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. Carla: An open urban driving simulator. arXiv preprint arXiv:1711.03938, 2017.
• European Parliament [2006] European Parliament. Regulation (EC) No 561/2006. http://data.europa.eu/eli/reg/2006/561/2015-03-02, 2006.
• Everett and Roberts [2018] R. Everett and S. Roberts. Learning against non-stationary agents with opponent modelling and deep reinforcement learning. In 2018 AAAI Spring Symposium Series, 2018.
• Geifman and El-Yaniv [2019] Y. Geifman and R. El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. arXiv preprint arXiv:1901.09192, 2019.
• Geifman et al. [2018] Y. Geifman, G. Uziel, and R. El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. In ICLR, 2018.
• Ghosh et al. [2020] A. Ghosh, S. Tschiatschek, H. Mahdavi, and A. Singla. Towards deployment of robust cooperative ai agents: An algorithmic framework for learning adaptive policies. In AAMAS, 2020.
• Grover et al. [2018] A. Grover, M. Al-Shedivat, J. Gupta, Y. Burda, and H. Edwards. Learning policy representations in multiagent systems. In ICML, 2018.
• Hadfield-Menell et al. [2016] D. Hadfield-Menell, S. Russell, P. Abbeel, and A. Dragan. Cooperative inverse reinforcement learning. In NIPS, 2016.
• Haug et al. [2018] L. Haug, S. Tschiatschek, and A. Singla. Teaching inverse reinforcement learners via features and demonstrations. In NeurIPS, 2018.
• Jaksch et al. [2010] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 2010.
• Kamalaruban et al. [2019] Parameswaran Kamalaruban, Rati Devidze, Volkan Cevher, and Adish Singla. Interactive teaching algorithms for inverse reinforcement learning. In IJCAI, 2019.
• Liu et al. [2019] Z. Liu, Z. Wang, P. Liang, R. Salakhutdinov, L. Morency, and M. Ueda. Deep gamblers: Learning to abstain with portfolio theory. In NeurIPS, 2019.
• Macadam [2003] C. Macadam. Understanding and modeling the human driver. Vehicle system dynamics, 40(1-3):101–134, 2003.
• Macindoe et al. [2012] O. Macindoe, L. Kaelbling, and T. Lozano-Pérez. Pomcop: Belief space planning for sidekicks in cooperative games. In AIIDE, 2012.
• Mnih et al. [2015] V. Mnih et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
• Nikolaidis et al. [2015] S. Nikolaidis, R. Ramakrishnan, K. Gu, and J. Shah. Efficient model learning from joint-action demonstrations for human-robot collaborative tasks. In HRI, 2015.
• Nikolaidis et al. [2017] S. Nikolaidis, J. Forlizzi, D. Hsu, J. Shah, and S. Srinivasa. Mathematical models of adaptation in human-robot collaboration. arXiv preprint arXiv:1707.02586, 2017.
• Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
• Radanovic et al. [2019] Goran Radanovic, Rati Devidze, David C. Parkes, and Adish Singla. Learning to collaborate in markov decision processes. In ICML, 2019.
• Raghu et al. [2019a] M. Raghu, K. Blumer, G. Corrado, J. Kleinberg, Z. Obermeyer, and S. Mullainathan. The algorithmic automation problem: Prediction, triage, and human effort. arXiv preprint arXiv:1903.12220, 2019a.
• Raghu et al. [2019b] M. Raghu, K. Blumer, R. Sayres, Z. Obermeyer, B. Kleinberg, S. Mullainathan, and J. Kleinberg. Direct uncertainty prediction for medical second opinions. In ICML, 2019b.
• Ramaswamy et al. [2018] H. Ramaswamy, A. Tewari, and S. Agarwal. Consistent algorithms for multiclass classification with an abstain option. Electronic J. of Statistics, 2018.
• Silver et al. [2016] D. Silver et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
• Silver et al. [2017] D. Silver et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
• Strehl and Littman [2008] A. Strehl and M. Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
• Talpaert et al. [2019] V. Talpaert et al. Exploring applications of deep reinforcement learning for real-world autonomous driving systems. arXiv preprint arXiv:1901.01536, 2019.
• Thulasidasan et al. [2019] S. Thulasidasan, T. Bhattacharya, J. Bilmes, G. Chennupati, and J. Mohd-Yusof. Combating label noise in deep learning using abstention. arXiv preprint arXiv:1905.10964, 2019.
• Tschiatschek et al. [2019] S. Tschiatschek, A. Ghosh, L. Haug, R. Devidze, and A. Singla. Learner-aware teaching: Inverse reinforcement learning with preferences and constraints. In NeurIPS, 2019.
• Vinyals et al. [2019] O. Vinyals et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, pages 1–5, 2019.
• Wilson and Daugherty [2018] H. Wilson and P. Daugherty. Collaborative intelligence: humans and ai are joining forces. Harvard Business Review, 2018.
• Wolfe et al. [2019] B. Wolfe, B. Seppelt, B. Mehler, B. Reimer, and R. Rosenholtz. Rapid holistic perception and evasion of road hazards. Journal of experimental psychology: general, 2019.
• Wymann et al. [2000] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and A. Sumner. Torcs, the open racing car simulator. Software available at http://torcs.sourceforge.net, 4(6), 2000.
• Zheng et al. [2018] Y. Zheng, Z. Meng, J. Hao, Z. Zhang, T. Yang, and C. Fan. A deep bayesian policy reuse approach against non-stationary agents. In NeurIPS, 2018.

## Appendix A Proofs

### A.1 Bellman’s principle of optimality

First, we bound the optimal value function v_{t}(s,d) from below as follows:

 \displaystyle v_{t}(s,d) \displaystyle=\min_{\pi_{t},\ldots,\pi_{T-1}}\EE\left[\sum_{t^{\prime}=t}^{T-1% }c_{\pi_{t^{\prime}}}(s_{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t}=s,d_{t-1}=d\right] \displaystyle=\min_{\pi_{t}(\cdot|s,d)}\,c_{\pi_{t}}(s,d)+\min_{\pi_{t+1},% \ldots,\pi_{T-1}}\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p% (\cdot{\,|\,}s,d^{\prime})}\left[\,\EE\left[\sum_{t^{\prime}=t+1}^{T-1}c_{\pi_% {t^{\prime}}}(s_{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=% d^{\prime}\right]\right] \displaystyle\overset{(i)}{\geq}\min_{\pi_{t}(\cdot|s,d)}\,c_{\pi_{t}}(s,d)+% \EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d% ^{\prime})}\left[\,\min_{\pi_{t+1},\ldots,\pi_{T-1}}\EE\left[\sum_{t^{\prime}=% t+1}^{T-1}c_{\pi_{t^{\prime}}}(s_{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=% s^{\prime},d_{t}=d^{\prime}\right]\right] \displaystyle=\min_{\pi_{t}(\cdot{\,|\,}s,d)}\quad c_{\pi_{t}}(s,d)+\EE_{d^{% \prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime% })}\left[v_{t+1}(s^{\prime},d^{\prime})\right],

where (i) readily follows from the fact that \min_{a}\EE[X(a)]\geq\EE[\min_{a}X(a)].

Then, we bound the optimal value function v_{t}(s,d) from above as follows:

 \displaystyle v_{t}(s,d) \displaystyle=\min_{\pi_{t},\ldots,\pi_{L-1}}\EE\left[\sum_{t^{\prime}=t}^{T-1% }c_{\pi_{t^{\prime}}}(s_{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t}=s,d_{t-1}=d\right] \displaystyle=\min_{\pi_{t}(\cdot|s,d)}\,\hskip-2.845276ptc_{\pi_{t}}(s,d) \displaystyle\qquad+\min_{\pi_{t+1},\ldots,\pi_{L-1}}\EE_{d^{\prime}\sim\pi_{t% }(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime})}\left[\,\EE_{% \{(s_{t^{\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{T-1}\sim p_{\pi|p_{% \mathbb{H}},p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{T-1}c_{\pi_{t^{\prime% }}}(s_{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}% \right]\right] \displaystyle\overset{(i)}{\leq}c_{\pi_{t}}(s,d)+\EE_{d^{\prime}\sim\pi_{t}(% \cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime})}\left[\,\EE_{\{(% s_{t^{\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{T-1}\sim p_{\pi^{*}|p_{% \mathbb{H}},p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{T-1}c_{\pi_{t^{\prime% }}}(s_{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}% \right]\right] \displaystyle\overset{(ii)}{=}\min_{\pi_{t}(\cdot{\,|\,}s,d)}c_{\pi_{t}}(s,d)+% \EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d% ^{\prime})}\left[v_{t+1}(s^{\prime},d^{\prime})\right],

where (i) follows from the fact that

 \displaystyle\min_{\pi_{t+1},\ldots,\pi_{L-1}}\EE_{d^{\prime}\sim\pi_{t}(\cdot% {\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime})}\left[\,\EE_{\{(s_{t^% {\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{T-1}\sim p_{\pi|p_{\mathbb{H}}% ,p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{T-1}c_{\pi_{t^{\prime}}}(s_{t^{% \prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}\right]\right] \displaystyle\ \ \leq\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}% \sim p(\cdot{\,|\,}s,d^{\prime})}\left[\,\EE_{\{(s_{t^{\prime}},d_{t^{\prime}-% 1})\}_{t^{\prime}=t+1}^{T-1}\sim p_{\pi|p_{\mathbb{H}},p_{\mathbb{M}}}}\left[% \sum_{t^{\prime}=t+1}^{T-1}c_{\pi_{t^{\prime}}}(s_{t^{\prime}},d_{t^{\prime}-1% }){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}\right]\right]\quad\forall\ \pi_{t% +1},\ldots,\pi_{T-1},

and, if we set \pi_{t+1}=\pi^{*}_{t+1},\ldots,\pi_{L-1}=\pi^{*}_{T-1}, where

 \displaystyle\pi^{*}_{t+1},\ldots,\pi^{*}_{T-1}=\argmin_{\pi_{t+1},...,\pi_{T-% 1}}\EE_{\{(s_{t^{\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{T-1}\sim p_{% \pi|p_{\mathbb{H}},p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{T-1}c_{\pi_{t^% {\prime}}}(s_{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{% \prime}\right],

then equality (ii) also holds. Since the upper and lower bound are the same, we can conclude that the optimal value function satisfies Eq. 5.

### A.2 Proof of Proposition 3

By definition, we have that:

 \displaystyle c_{\pi_{t}}(s,d) \displaystyle=\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d)}[\bar{c}(s,d^{\prime% })+\lambda_{1}\II(d^{\prime}=\mathbb{H})+\lambda_{2}\II(d\neq d^{\prime})] \displaystyle=\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\cdot\left[\bar{c}(s,% \mathbb{M})+\lambda_{1}\cdot 0+\lambda_{2}\II(d\neq\mathbb{M})\right]+\left(1-% \pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\right)\cdot\left[\bar{c}(s,\mathbb{H}% )+\lambda_{1}+\lambda_{2}\II(d\neq\mathbb{H})\right]. (17)

 \displaystyle\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),s^{\prime}\sim p(.{\,% |\,}s,d^{\prime})}[v_{t+1}(s^{\prime},d^{\prime})] \displaystyle=\pi_{t}(d^{\prime}=\mathbb{M}|s,d)\cdot\EE_{s^{\prime}\sim p(.{% \,|\,}s,\mathbb{M})}[v_{t+1}(s^{\prime},\mathbb{M})] \displaystyle\qquad+(1-\pi_{t}(d^{\prime}=\mathbb{M}|s,d))\cdot\EE_{s^{\prime}% \sim p(.{\,|\,}s,\mathbb{H})}[v_{t+1}(s^{\prime},\mathbb{H})] \displaystyle\overset{(i)}{=}\pi_{t}(d^{\prime}=\mathbb{M}|s,d)\cdot\EE_{a\sim p% _{\mathbb{M}}(\cdot{\,|\,}s),s^{\prime}\sim p(\cdot{\,|\,}s,a)}[v_{t+1}(s^{% \prime},\mathbb{M})] \displaystyle\qquad+(1-\pi_{t}(d^{\prime}=\mathbb{M}|s,d))\cdot\EE_{a\sim p_{% \mathbb{H}}(\cdot{\,|\,}s),s^{\prime}\sim p(\cdot{\,|\,}s,a)}[v_{t+1}(s^{% \prime},\mathbb{H})], (18)

where (i) follows from the fact that

 \EE_{s^{\prime}\sim p(\cdot{\,|\,}s,\mathbb{M})}[\bullet]=\EE_{a\sim p_{% \mathbb{M}}(\cdot{\,|\,}s)}[\,\EE_{s^{\prime}\sim p(\cdot{\,|\,}s,a)}[\bullet]% ]\,\,\mbox{and}\,\,\EE_{s^{\prime}\sim p(\cdot{\,|\,}s,\mathbb{H})}[\bullet]=% \EE_{a\sim p_{\mathbb{H}}(\cdot{\,|\,}s)}[\,\EE_{s^{\prime}\sim p(\cdot{\,|\,}% s,a)}[\bullet]].

Then, if we sum up Eq. 17 and Eq. 18, we have that

 \displaystyle c_{\pi_{t}}(s,d)+ \displaystyle\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),s^{\prime}\sim p(.{\,% |\,}s,d^{\prime})}[v_{t+1}(s^{\prime},d^{\prime})] \displaystyle=\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\cdot\left[\bar{c}(s,% \mathbb{M})+\lambda_{2}\II(d\neq\mathbb{M})+\EE_{a\sim p_{\mathbb{M}}(\cdot{\,% |\,}s),s^{\prime}\sim p(\cdot{\,|\,}s,a)}[v_{t+1}(s^{\prime},\mathbb{M})]\right] \displaystyle\quad+\left(1-\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\right)% \cdot\left[\bar{c}(s,\mathbb{H})+\lambda_{1}+\lambda_{2}\II(d\neq\mathbb{H})+% \EE_{a\sim p_{\mathbb{H}}(\cdot{\,|\,}s),s^{\prime}\sim p(\cdot{\,|\,}s,a)}[v_% {t+1}(s^{\prime},\mathbb{H})]\right] (19) \displaystyle=\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\cdot q_{t{\,|\,}\mathbb% {M}}(s,d)+\left(1-\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\right)\cdot q_{t{\,% |\,}\mathbb{H}}(s,d). (20)

Finally, it is clear that the above quantity is minimized when

 \displaystyle\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)=\begin{cases}1\text{ if % }q_{t{\,|\,}\mathbb{M}}(s,d)

This concludes the proof.

### A.3 Proof of Theorem 4

For any episode k, we can show that the optimal value function v_{t}^{k}(s,d) satisfies Bellman’s principle of optimality (refer to Lemma A.3 at the end of this proof), \ie,

 \displaystyle v^{k}_{t}(s,d)=\min_{\pi_{t}(\cdot{\,|\,}s,d)}\quad\min_{p_{% \mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}c_{\pi_{t}}(s,% d)+\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}% s,d^{\prime},t)}\left[v^{k}_{t+1}(s^{\prime},d^{\prime})\right],

Moreover, similarly as in Eqs. 17 and 18 in the proof of Proposition 3 (Appendix A.2), we have that:

 \displaystyle c_{\pi_{t}}(s,d) \displaystyle=\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\cdot\left[\bar{c}(s,% \mathbb{M})+\lambda_{1}\cdot 0+\lambda_{2}\II(d\neq\mathbb{M})\right]+\left(1-% \pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\right)\cdot\left[\bar{c}(s,\mathbb{H}% )+\lambda_{1}+\lambda_{2}\II(d\neq\mathbb{H})\right] (22)

and

 \displaystyle\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),s^{\prime}\sim p(.{\,% |\,}s,d^{\prime},t)}[v^{k}_{t+1}(s^{\prime},d^{\prime})] \displaystyle=\pi_{t}(d^{\prime}=\mathbb{M}|s,d)\cdot\EE_{a\sim p_{\mathbb{M}}% (\cdot{\,|\,}s),s^{\prime}\sim p(\cdot{\,|\,}s,a)}[v^{k}_{t+1}(s^{\prime},% \mathbb{M})] \displaystyle\qquad+(1-\pi_{t}(d^{\prime}=\mathbb{M}|s,d,t))\cdot\EE_{a\sim p_% {\mathbb{H}}(\cdot{\,|\,}s,t),s^{\prime}\sim p(\cdot{\,|\,}s,a)}[v^{k}_{t+1}(s% ^{\prime},\mathbb{H})], (23)

where note that the human policy p_{\mathbb{H}} may depend on the time t. Then, if we sum up Eq. 22 and 23, we have that

 c_{\pi_{t}}(s,d)+\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),s^{\prime}\sim p(% .{\,|\,}s,d^{\prime},t)}[v^{k}_{t+1}(s^{\prime},d^{\prime})]=\pi_{t}(d^{\prime% }=\mathbb{M}{\,|\,}s,d)\cdot q^{k}_{t{\,|\,}\mathbb{M}}(s,d)+\left(1-\pi_{t}(d% ^{\prime}=\mathbb{M}{\,|\,}s,d)\right)\cdot q^{k}_{t{\,|\,}\mathbb{H}}(s,d)

Hence, it follows that

 \displaystyle\min_{p_{\mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|% \,}s,t}}c_{\pi_{t}}(s,d) \displaystyle+\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(% \cdot{\,|\,}s,d^{\prime},t)}\left[v^{k}_{t+1}(s^{\prime},d^{\prime})\right] \displaystyle\quad=\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\cdot q^{k}_{t{\,|% \,}\mathbb{M}}(s,d)+\left(1-\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\right)% \cdot\min_{p_{\mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}% q^{k}_{t{\,|\,}\mathbb{H}}(s,d)

Finally, it is clear that the above quantity is minimized when

 \displaystyle\pi_{t}(d^{\prime}=\mathbb{M}{\,|\,}s,d)=\begin{cases}1\text{ if % }q^{k}_{t{\,|\,}\mathbb{M}}(s,d)<\min_{p_{\mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal% ^{k}_{\mathbb{H}{\,|\,}s,t}}q^{k}_{t{\,|\,}\mathbb{H}}(s,d)\\ 0\ \ \text{otherwise}.\\ \end{cases} (24)

and thus

 \displaystyle v^{k}_{t}(s,d)=\min\left(q^{k}_{t{\,|\,}\mathbb{M}}(s,d),\min_{p% _{\mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}q^{k}_{t{\,|% \,}\mathbb{H}}(s,d)\right)

This concludes the proof.

{lemma}

[Bellman optimality principle for unknown human policy] For any episode k, the optimal value function v_{t}^{k}(s,d), as defined in Eq. 13, satisfies the following recursive equation:

 v^{k}_{t}(s,d)=\min_{\pi_{t}(\cdot{\,|\,}s,d)}\quad\min_{p_{\mathbb{H}}(\cdot{% \,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}c_{\pi_{t}}(s,d)+\EE_{d^{\prime}% \sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime},t)}% \left[v^{k}_{t+1}(s^{\prime},d^{\prime})\right]. (25)
###### Proof.

Define \Pcal_{\mathbb{H}|.,{t^{+}}}^{k}:=\times_{s\in\Scal,t^{\prime}\in\{t,..,L-1\}}% \Pcal_{\mathbb{H}|s,t^{\prime}}^{k}. Then, we proceed similarly as in Appendix A.1. First, we bound the optimal value function v^{k}_{t}(s,d) from below as follows:

 \displaystyle v^{k}_{t}(s,d) \displaystyle=\min_{\pi_{t},\ldots,\pi_{L-1}}\min_{p_{\mathbb{H}}\in\Pcal^{k}_% {\mathbb{H}}}\EE\left[\sum_{t^{\prime}=t}^{L-1}c_{\pi_{t^{\prime}}}(s_{t^{% \prime}},d_{t^{\prime}-1}){\,|\,}s_{t}=s,d_{t-1}=d\right] \displaystyle=\min_{\pi_{t}(\cdot|s,d)}\,\min_{p_{\mathbb{H}}(\cdot{\,|\,}s,t)% \in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}\,c_{\pi_{t}}(s,d) \displaystyle\quad+\min_{\pi_{t+1},\ldots,\pi_{L-1}}\min_{p_{\mathbb{H}}\in% \Pcal^{k}_{\mathbb{H}|.,(t+1)^{+}}}\EE_{\begin{subarray}{c}d^{\prime}\sim\pi_{% t}(\cdot{\,|\,}s,d)\\ s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime},t)\end{subarray}}\left[\,\EE\left[% \sum_{t^{\prime}=t+1}^{L-1}c_{\pi_{t^{\prime}}}(s_{t^{\prime}},d_{t^{\prime}-1% }){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}\right]\right] \displaystyle\overset{(i)}{\geq}\min_{\pi_{t}(\cdot|s,d)}\,\min_{p_{\mathbb{H}% }(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}\,c_{\pi_{t}}(s,d) \displaystyle\quad+\EE_{\begin{subarray}{c}d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s% ,d)\\ s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime},t)\end{subarray}}\left[\,\min_{\pi_{% t+1},\ldots,\pi_{L-1}}\min_{p_{\mathbb{H}}\in\Pcal^{k}_{\mathbb{H}|.,(t+1)^{+}% }}\EE\left[\sum_{t^{\prime}=t+1}^{L-1}c_{\pi_{t^{\prime}}}(s_{t^{\prime}},d_{t% ^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}\right]\right] \displaystyle=\min_{\pi_{t}(\cdot{\,|\,}s,d)}\,\min_{p_{\mathbb{H}}(\cdot{\,|% \,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}c_{\pi_{t}}(s,d)+\EE_{d^{\prime}% \sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime},t)}% \left[v^{k}_{t+1}(s^{\prime},d^{\prime})\right],

where (i) readily follows from the fact that \min_{a}\EE[X(a)]\geq\EE[\min_{a}X(a)].

Next, we bound the optimal value function v^{k}_{t}(s,d) from above as follows:

 \displaystyle v^{k}_{t} \displaystyle(s,d) \displaystyle=\min_{\pi_{t},\ldots,\pi_{L-1}}\min_{p_{\mathbb{H}}\in\Pcal^{k}_% {\mathbb{H}}}\EE\left[\sum_{t^{\prime}=t}^{L-1}c_{\pi_{t^{\prime}}}(s_{t^{% \prime}},d_{t^{\prime}-1}){\,|\,}s_{t}=s,d_{t-1}=d\right] \displaystyle=\min_{\pi_{t}(\cdot|s,d)}\,\min_{p_{\mathbb{H}}(\cdot{\,|\,}s,t)% \in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}c_{\pi_{t}}(s,d) \displaystyle\qquad+\min_{\pi_{t+1},\ldots,\pi_{L-1}}\min_{p_{\mathbb{H}}\in% \Pcal^{k}_{\mathbb{H}|.,(t+1)^{+}}}\EE_{\begin{subarray}{c}d^{\prime}\sim\pi_{% t}(\cdot{\,|\,}s,d)\\ s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime},t)\end{subarray}}\left[\,\EE_{\{(s_{% t^{\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{L-1}\sim p_{\pi|p_{\mathbb{H% }},p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{L-1}c_{\pi_{t^{\prime}}}(s_{t^% {\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}\right]\right] \displaystyle\overset{(i)}{\leq}\min_{\pi_{t}(\cdot|s,d)}\,\min_{p_{\mathbb{H}% }(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}c_{\pi_{t}}(s,d) \displaystyle\quad\quad+\EE_{\begin{subarray}{c}d^{\prime}\sim\pi_{t}(\cdot{\,% |\,}s,d)\\ s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime},t)\end{subarray}}\left[\,\EE_{\{(s_{% t^{\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{L-1}\sim p_{\pi^{*}|p^{*}_{% \mathbb{H}},p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{L-1}c_{\pi_{t^{\prime% }}}(s_{t^{\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}% \right]\right] \displaystyle\overset{(ii)}{=}\min_{\pi_{t}(\cdot{\,|\,}s,d)}\quad\min_{p_{% \mathbb{H}}(\cdot{\,|\,}s,t)\in\Pcal^{k}_{\mathbb{H}{\,|\,}s,t}}c_{\pi_{t}}(s,% d)+\EE_{d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p(\cdot{\,|\,}% s,d^{\prime},t)}\left[v^{k}_{t+1}(s^{\prime},d^{\prime})\right]

where (i) follows from the fact that

 \displaystyle\min_{\pi_{t+1},\ldots,\pi_{L-1}} \displaystyle\min_{p_{\mathbb{H}}\in\Pcal^{k}_{\mathbb{H}|.,(t+1)^{+}}}\EE_{% \begin{subarray}{c}d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d)\\ s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime},t)\end{subarray}}\left[\,\EE_{\{(s_{% t^{\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{L-1}\sim p_{\pi|p_{\mathbb{H% }},p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{L-1}c_{\pi_{t^{\prime}}}(s_{t^% {\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}\right]\right] \displaystyle\leq\EE_{\begin{subarray}{c}d^{\prime}\sim\pi_{t}(\cdot{\,|\,}s,d% )\\ s^{\prime}\sim p(\cdot{\,|\,}s,d^{\prime},t)\end{subarray}}\left[\,\EE_{\{(s_{% t^{\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{L-1}\sim p_{\pi|p_{\mathbb{H% }},p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{L-1}c_{\pi_{t^{\prime}}}(s_{t^% {\prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}\right]\right] \displaystyle\qquad\qquad\qquad\forall\ \pi_{t+1},\ldots,\pi_{L-1},\ p_{% \mathbb{H}}\in\Pcal^{k}_{\mathbb{H}|.,(t+1)^{+}}.

and, if we set \pi_{t+1}=\pi^{*}_{t+1},\ldots,\pi_{L-1}=\pi^{*}_{L-1} and p_{\mathbb{H}}=p^{*}_{\mathbb{H}}, where

 \displaystyle\pi^{*},p^{*}_{\mathbb{H}}=\argmin_{\begin{subarray}{c}\pi_{t+1},% ...,\pi_{L-1}\\ p_{\mathbb{H}}\in\Pcal^{k}_{\mathbb{H}|.,(t+1)^{+}}\end{subarray}}\EE_{\{(s_{t% ^{\prime}},d_{t^{\prime}-1})\}_{t^{\prime}=t+1}^{L-1}\sim p_{\pi|p_{\mathbb{H}% },p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=t+1}^{L-1}c_{\pi_{t^{\prime}}}(s_{t^{% \prime}},d_{t^{\prime}-1}){\,|\,}s_{t+1}=s^{\prime},d_{t}=d^{\prime}\right], (26)

then equality (ii) holds. Since the upper and lower bounds are the same, we can conclude that the optimal value function satisfies Eq. 25.

### A.4 Proof of Lemma 4

Suppose there is \{x^{\prime}_{i};\;\sum_{i}x^{\prime}_{i}=1,\,x^{\prime}_{i}\geq 0\} such that \sum_{i}x^{\prime}_{i}w_{i}<\sum_{i}x^{*}_{i}w_{i}. Let j\in\{1,\dots,m\} be the first index where x^{\prime}_{j}\neq x^{*}_{j}, then it’s clear that x^{\prime}_{j}>x^{*}_{j}.
If j=1:

 \displaystyle\sum_{i=1}^{m}|x^{\prime}_{i}-b_{i}|=|x^{\prime}_{1}-b_{1}|+\sum_% {i=2}^{m}|x^{\prime}_{i}-b_{i}|>\frac{d}{2}+\sum_{i=2}^{m}b_{i}-x^{\prime}_{i}% =\frac{d}{2}+x^{\prime}_{1}-b_{1}>d (27)

If j>1:

 \displaystyle\sum_{i=1}^{m}|x^{\prime}_{i}-b_{i}|=|x^{\prime}_{1}-b_{1}|+\sum_% {i=j}^{m}|x^{\prime}_{i}-b_{i}|>\frac{d}{2}+\sum_{i=j+1}^{m}b_{i}-x^{\prime}_{% i}>\frac{d}{2}+x^{\prime}_{1}-b_{1}=d (28)

Both cases contradict the condition \sum_{i=1}^{m}|x^{\prime}_{i}-b_{i}|\leq d.

### A.5 Proof of Theorem 4

###### Proof.

Throughout the proof, we will assume that c{{}^{\prime}}(s,a)+\lambda_{1}+\lambda_{2}\leq 1 for all s\in\Scal and a\in\Acal and we will denote

• (i)

c{{}^{\prime}}(s,a) as the environment cost of action a at state s;

• (ii)

c_{\pi|p_{\mathbb{H}},p_{\mathbb{M}}} as the immediate cost due to switching policy \pi and human policy p_{\mathbb{H}}, as defined in Eq. 2;

• (iii)

p_{|p_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,d,t)=\sum_{a\in\Acal}p_{d}(a{% \,|\,}s,t)p(\cdot{\,|\,}s,a) as the transition probability under the human policy p_{\mathbb{H}}(a{\,|\,}s,t) and the machine policy p_{\mathbb{M}}(a{\,|\,}s,t)=p_{\mathbb{M}}(a{\,|\,}s);

• (iv)

p^{*}_{\mathbb{H}} as the true human policy;

• (v)

\pi^{*} as the optimal switching policy, as defined in Eq. 3;

• (vi)

v^{k}_{t}(s,d) as the optimal value function, as defined in Eq. 13;

• (vii)

\pi^{k} and p^{k}_{\mathbb{H}} as the switching policy and human policy that minimize the optimal value function v^{k}_{t}(s,d);

• (viii)

\vbar_{t}^{k}(s,d) as the value function under the policy \pi_{k} and the true human policy p^{*}_{\mathbb{H}}, \ie,

 \displaystyle\vbar_{t}^{k}(s,d) \displaystyle=\EE_{\tau\sim P_{\pi^{k}|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}}% \left[\sum_{t^{\prime}=t}^{L-1}c_{\pi_{t}^{\prime}}(s_{t^{\prime}},d_{t^{% \prime}-1}){\,|\,}s_{t}=s,d_{t-1}=d\right] \displaystyle=c_{\pi^{k}|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(s,d)+\EE_{d^{% \prime}\sim\pi_{t}^{k}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p_{|p^{*}_{\mathbb{H}% },p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime})}\left[\vbar_{t+1}^{k}\right]; (29)
• (ix)

\Delta_{k} as the regret for the episode k, \ie,

 \Delta_{k}=\EE_{\tau\sim P_{\pi^{k}|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}}\left[c% (\tau{\,|\,}s_{0},d_{-1})\right]-\EE_{\tau\sim P_{\pi^{*}|p^{*}_{\mathbb{H}},p% _{\mathbb{M}}}}\left[c(\tau{\,|\,}s_{0},d_{-1})\right]=\vbar_{0}^{k}(s_{0},d_{% -1})-v_{0}(s_{0},d_{-1}), (30)

where v_{0}(s,d) is defined in Eq. 4.

First, we note that

 \displaystyle R(T) \displaystyle=\sum_{k=1}^{K}\Delta_{k}=\sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{% \mathbb{H}}\in\Pcal^{k}_{\mathbb{H}})+\sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{% \mathbb{H}}\not\in\Pcal^{k}_{\mathbb{H}}). (31)

Next, we split our analysis into two parts. We first bound the first term \sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\in\Pcal^{k}_{\mathbb{H}}) and then bound the second term \sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\not\in\Pcal^{k}_{\mathbb{H}}).

Computing the bound on \sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\in\Pcal^{k}_{\mathbb{H}})

First, we note that

 \displaystyle\Delta_{k}=\vbar_{0}^{k}(s_{0},d_{-1})-v_{0}(s_{0},d_{-1})\leq% \vbar_{0}^{k}(s_{0},d_{-1})-v_{0}^{k}(s_{0},d_{-1}) (32)

This is because

 \displaystyle v_{0}^{k}(s_{0},d_{-1}) \displaystyle=\min_{\pi}\min_{p_{\mathbb{H}}\in\Pcal_{\mathbb{H}}^{k}}\EE_{(s_% {t^{\prime}},d_{t^{\prime}-1})_{t^{\prime}=0}^{L-1}\sim p_{\pi|p_{\mathbb{H}},% p_{\mathbb{M}}}}\left[\sum_{t^{\prime}=0}^{L-1}c_{\pi_{t}^{\prime}}(s_{t^{% \prime}},d_{t^{\prime}-1}){\,|\,}s_{0},d_{-1}\right] \displaystyle\overset{(i)}{\leq}\min_{\pi}\EE_{(s_{t^{\prime}},d_{t^{\prime}-1% })_{t^{\prime}=0}^{L-1}\sim p_{\pi|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}}\left[% \sum_{t^{\prime}=0}^{L-1}c_{\pi_{t}^{\prime}}(s_{t^{\prime}},d_{t^{\prime}-1})% {\,|\,}s_{0},d_{-1}\right]=v_{0}(s_{0},d_{-1}), (33)

where (i) holds because the true human policy p^{*}_{\mathbb{H}}\in\Pcal_{\mathbb{H}}^{k}. Now, we aim to bound v_{0}^{k}(s_{0},d_{-1}). To this aim, we first note that

 \displaystyle\vbar_{0}^{k}(s,d)-v_{0}^{k}(s,d) \displaystyle=c_{\pi^{k}|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(s,d)+\EE_{d^{% \prime}\sim\pi_{0}^{k}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p_{|p^{*}_{\mathbb{H}% },p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime})}\left[\vbar_{1}^{k}(s^{\prime},d^{% \prime})\right] \displaystyle\qquad-c_{\pi^{k}|p^{k}_{\mathbb{H}},p_{\mathbb{M}}}(s,d)-\EE_{d^% {\prime}\sim\pi^{k}_{0}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p_{|p^{k}_{\mathbb{H% }},p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime},t=0)}\left[v^{k}_{1}(s^{\prime},d^% {\prime})\right] \displaystyle\overset{(i)}{=}\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\left% [\EE_{a\sim p_{\mathbb{M}}(\cdot{\,|\,}s)}\left[c{{}^{\prime}}(s,a)\right]+% \lambda_{2}\II(d\neq\mathbb{M})\right] \displaystyle\qquad+(1-\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d))\left[\EE_% {a\sim p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)}\left[c{{}^{\prime}}(s,a)\right]+% \lambda_{1}+\lambda_{2}\II(d\neq\mathbb{H})\right] \displaystyle\qquad+\EE_{d^{\prime}\sim\pi_{0}^{k}(\cdot{\,|\,}s,d),\,s^{% \prime}\sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime})}% \left[\vbar_{1}^{k}(s^{\prime},d^{\prime})\right] \displaystyle\qquad-\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\left[\EE_{a% \sim p_{\mathbb{M}}(\cdot{\,|\,}s)}\left[c{{}^{\prime}}(s,a)\right]+\lambda_{2% }\II(d\neq\mathbb{M})\right] \displaystyle\qquad-(1-\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d))\left[\EE_% {a\sim p^{k}_{\mathbb{H}}(\cdot{\,|\,}s,t=0)}\left[c{{}^{\prime}}(s,a)\right]+% \lambda_{1}+\lambda_{2}\II(d\neq\mathbb{H})\right] \displaystyle\qquad-\EE_{d^{\prime}\sim\pi^{k}_{0}(\cdot{\,|\,}s,d),\,s^{% \prime}\sim p_{|p^{k}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime},t=% 0)}\left[v^{k}_{1}(s^{\prime},d^{\prime})\right] \displaystyle=(1-\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d))\left[\EE_{a\sim p% ^{*}_{\mathbb{H}}(\cdot{\,|\,}s)}\left[c{{}^{\prime}}(s,a)\right]-\EE_{a\sim p% ^{k}_{\mathbb{H}}(\cdot{\,|\,}s,t=0)}\left[c{{}^{\prime}}(s,a)\right]\right] \displaystyle+\EE_{d^{\prime}\sim\pi_{0}^{k}(\cdot{\,|\,}s,d),\,s^{\prime}\sim p% _{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime})}\left[\vbar_{1% }^{k}(s^{\prime},d^{\prime})\right]-\EE_{d^{\prime}\sim\pi^{k}_{0}(\cdot{\,|\,% }s,d),\,s^{\prime}\sim p_{|p^{k}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,d^% {\prime},t=0)}\left[v^{k}_{1}(s^{\prime},d^{\prime})\right] \displaystyle\overset{(ii)}{=}(1-\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d))% \left[\EE_{a\sim p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)}\left[c{{}^{\prime}}(s,a)% \right]-\EE_{a\sim p^{k}_{\mathbb{H}}(\cdot{\,|\,}s,t=0)}\left[c{{}^{\prime}}(% s,a)\right]\right] \displaystyle+\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\left[\EE_{s^{\prime% }\sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{M})}\left[% \vbar_{1}^{k}(s^{\prime},\mathbb{M})\right]-\EE_{s^{\prime}\sim p_{p^{k}_{% \mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{M},t=0)}\left[v^{k}_{1}(s^{% \prime},\mathbb{M})\right]\right] \displaystyle+(1-\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d))\left[\EE_{s^{% \prime}\sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{H})}% \left[\vbar_{1}^{k}(s^{\prime},\mathbb{H})\right]-\EE_{s^{\prime}\sim p_{p^{k}% _{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{H},t=0)}\left[v^{k}_{1}(s^% {\prime},\mathbb{H})\right]\right], (34)

where (i) follows from the definition of c_{\pi^{k}|p^{*}_{\mathbb{H}},p_{\mathbb{M}}} and c_{\pi^{k}|p^{k}_{\mathbb{H}},p_{\mathbb{M}}} and (ii) follows from applying conditional expectation. Next, we note that p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{M})=p_{|p_{% \mathbb{H}}^{k},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{M},t=0), because the machine policy is independent of the human policy. Hence, \EE_{s^{\prime}\sim p_{p^{k}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,% \mathbb{M},t=0)}\left[v^{k}_{1}(s^{\prime},\mathbb{M})\right]=\EE_{s^{\prime}% \sim p_{p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{M})}\left[v^{% k}_{1}(s^{\prime},\mathbb{M})\right] and, by adding and subtracting (1-\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d))\EE_{s^{\prime}\sim p_{|p^{*}_% {\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{H})}\left[v_{1}^{k}(s^{% \prime},\mathbb{H})\right] to Eq. 34, it follows that:

where (i) follows from the fact that

 \displaystyle\EE_{{d^{\prime}\sim\pi^{k}_{0}(d^{\prime}|s,d),\,s^{\prime}\sim p% _{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime})}}\left[\vbar_{% 1}^{k}(s^{\prime},d^{\prime})-v^{k}_{1}(s^{\prime},d^{\prime})\right] \displaystyle\quad=\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d)\left[\EE_{s^{% \prime}\sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{M})}% \left[\vbar_{1}^{k}(s^{\prime},\mathbb{M})\right]-\EE_{s^{\prime}\sim p_{|p^{*% }_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{M})}\left[v^{k}_{1}(s^{% \prime},\mathbb{M})\right]\right] \displaystyle\quad\quad+(1-\pi^{k}_{0}(d^{\prime}=\mathbb{M}{\,|\,}s,d))\left[% \EE_{s^{\prime}\sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,% \mathbb{H})}\left[\vbar_{1}^{k}(s^{\prime},\mathbb{H})\right]-\EE_{s^{\prime}% \sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{H})}\left[v_% {1}^{k}(s^{\prime},\mathbb{H})\right]\right]

and (ii) follows from the fact that

 \displaystyle\EE_{d^{\prime}\sim\pi^{k}_{0}(\cdot{\,|\,}s,d)}\left[\II(d^{% \prime}=\mathbb{H})\right]=P(d^{\prime}=\mathbb{H})=1-\pi^{k}_{0}(d^{\prime}=% \mathbb{M}{\,|\,}s,d). (36)

Now, we can bound the term

 \displaystyle\EE_{a\sim p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)}\left[c{{}^{\prime}}% (s,a)\right]-\EE_{a\sim p^{k}_{\mathbb{H}}(\cdot{\,|\,}s,t=0)}\left[c{{}^{% \prime}}(s,a)\right]+\EE_{s^{\prime}\sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}% }(\cdot{\,|\,}s,\mathbb{H})}\left[v_{1}^{k}(s^{\prime},\mathbb{H})\right]-\EE_% {s^{\prime}\sim p_{p^{k}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{H}% ,t=0)}\left[v^{k}_{1}(s^{\prime},\mathbb{H})\right]

as follows:

 \displaystyle\EE_{a\sim p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)} \displaystyle\left[c{{}^{\prime}}(s,a)\right]-\EE_{a\sim p^{k}_{\mathbb{H}}(% \cdot{\,|\,}s,t=0)}\left[c{{}^{\prime}}(s,a)\right]+\EE_{s^{\prime}\sim p_{|p^% {*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,\mathbb{H})}\left[v_{1}^{k}(s^{% \prime},\mathbb{H})\right]-\EE_{s^{\prime}\sim p_{p^{k}_{\mathbb{H}},p_{% \mathbb{M}}}(\cdot{\,|\,}s,\mathbb{H},t=0)}\left[v^{k}_{1}(s^{\prime},\mathbb{% H})\right] \displaystyle=\EE_{a\sim p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)}\left[c{{}^{\prime}% }(s,a)+\EE_{s^{\prime}\sim p(\cdot{\,|\,}s,a)}\left[v_{1}^{k}(s^{\prime},% \mathbb{H})\right]\right]-\EE_{a\sim p^{k}_{\mathbb{H}}(\cdot{\,|\,}s,t=0)}% \left[c{{}^{\prime}}(s,a)+\EE_{s^{\prime}\sim p(\cdot{\,|\,}s,a)}\left[v^{k}_{% 1}(s^{\prime},\mathbb{H})\right]\right] \displaystyle\overset{(i)}{=}\sum_{a\in\Acal}(p^{*}_{\mathbb{H}}(a{\,|\,}s)-p_% {\mathbb{H}}^{k}(a{\,|\,}s,t=0))\left[\underbrace{c{{}^{\prime}}(s,a)+\EE_{s^{% \prime}\sim p(\cdot{\,|\,}s,a)}\left[v_{1}^{k}(s^{\prime},\mathbb{H})\right]}_% {\leq L}\right] (37) \displaystyle\overset{(ii)}{\leq}\min\{L,\sum_{a\in\Acal}L\,|p^{*}_{\mathbb{H}% }(a{\,|\,}s)-p_{\mathbb{H}}^{k}(a{\,|\,}s,t=0)|\}=L\min\{1,\beta_{k}(s,\delta)\} (38)

where (i) follows from the fact that c{{}^{\prime}}(s,a)+\EE_{s^{\prime}\sim p(\cdot{\,|\,}s,a)}\left[v_{1}^{k}(s^{% \prime},\mathbb{H})\right]\leq L since, by assumption, c{{}^{\prime}}(s,a)+\lambda_{1}+\lambda_{2}\leq 1 for all s\in\Scal and a\in\Acal and (ii) follows from the fact that, by assumption, both p^{*}_{\mathbb{H}} and p_{\mathbb{H}}^{k} lie in the confidence set \Pcal^{k}_{\mathbb{H}}. Then, if we combine Eq. 38 in Eq. 35, we have that, for all s\in\Scal and d\in\{\mathbb{H},\mathbb{M}\}, it holds that

 \displaystyle\vbar_{0}^{k}(s,d)-v_{0}^{k}(s,d) \displaystyle\leq\EE_{d^{\prime}\sim\pi^{k}_{0}(\cdot{\,|\,}s,d)}\left[\II(d^{% \prime}=\mathbb{H})\right](L\min\{1,\beta_{k}(s,\delta)\}) \displaystyle\quad+\EE_{d^{\prime}\sim\pi^{k}_{0}(\cdot{\,|\,}s,d),\,s^{\prime% }\sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime})}\left[% \vbar_{1}^{k}(s^{\prime},d^{\prime})-v^{k}_{1}(s^{\prime},d^{\prime})\right] (39)

Similarly, we can show that, for all s\in\Scal and d\in\{\mathbb{H},\mathbb{M}\}, it holds that

 \displaystyle\vbar_{1}^{k}(s,d)-v_{1}^{k}(s,d) \displaystyle\leq\EE_{d^{\prime}\sim\pi^{k}_{0}(\cdot{\,|\,}s,d)}\left[\II(d^{% \prime}=\mathbb{H})\right](L\min\{1,\beta_{k}(s,\delta)\}) \displaystyle\quad+\EE_{d^{\prime}\sim\pi^{k}_{0}(\cdot{\,|\,}s,d),\,s^{\prime% }\sim p_{|p^{*}_{\mathbb{H}},p_{\mathbb{M}}}(\cdot{\,|\,}s,d^{\prime})}\left[% \vbar_{2}^{k}(s^{\prime},d^{\prime})-v^{k}_{2}(s^{\prime},d^{\prime})\right] (40)

Hence, we can show by induction that:

 \displaystyle\vbar_{0}^{k}(s_{0},d_{-1})-v_{0}^{k}(s_{0},d_{-1}) \displaystyle\leq L\EE\left[\sum_{t=0}^{L-1}\II(d_{t}=\mathbb{H})\min\{1,\beta% _{k}(s_{t},\delta)\}{\,|\,}s_{0},d_{-1}\right] (41)

where the expectation is taken over the MDP with switching policy \pi^{k} under true human policy p^{*}_{\mathbb{H}}.

As one may expect, we only have regret when the optimistic switching policy (\ie, \pi^{k}) chooses human (\ie, d_{t}=\mathbb{H}) and observing more human actions makes \beta_{t}(s,\delta) smaller. Hence, when p^{*}_{\mathbb{H}}\in\Pcal^{k}_{\mathbb{H}} , we can bound the total regret as follows:

 \displaystyle\sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\in\Pcal_{\mathbb{H% }}^{k})\leq\sum_{k=1}^{K}L\EE\left[\sum_{t=0}^{L-1}\II(d_{t}=\mathbb{H})\min\{% 1,\beta_{k}(s_{t},\delta)\}{\,|\,}d_{-1},\,s_{0}\right]. (42)

Finally, since c{{}^{\prime}}(\cdot,\cdot)+\lambda_{1}+\lambda_{2}<1, the worst-case regret is bounded by T. Therefore, we have that:

 \displaystyle\sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\in\Pcal_{\mathbb{H% }}^{k})\leq \displaystyle\min\left\{T,\sum_{k=1}^{K}L\EE\left[\sum_{t=0}^{L-1}\II(d_{t}=% \mathbb{H})\min\{1,\beta_{k}(s_{t},\delta)\}|s_{0},d_{-1}\right]\right\}% \overset{(i)}{\leq}12L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}% \right)} (43)

where (i) follows from Lemma A.5, which is given at the end of this proof.

Computing the bound on \sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\not\in\Pcal^{k}_{\mathbb{H}})

Here, we use a similar approach to Jaksch et al. (2010). First, we note that

 \displaystyle\sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\not\in\Pcal_{% \mathbb{H}}^{k}) \displaystyle=\sum_{k=1}^{\lfloor\frac{\sqrt{T}}{L}\rfloor}\Delta_{k}\II(p^{*}% _{\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{k})+\sum_{k=\lfloor\frac{\sqrt{T}}{L}% \rfloor+1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{k}) (44)

Next, our goal is to show that second term of the RHS of above equation vanishes with high probability. If we succeed, then it holds that, with high probability, \sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{k}) equals the first term of the RHS and then we will be done because

 \displaystyle\sum_{k=1}^{\lfloor\frac{\sqrt{T}}{L}\rfloor}\Delta_{k}\II(p^{*}_% {\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{k})\leq\sum_{k=1}^{\lfloor\frac{\sqrt{T% }}{L}\rfloor}\Delta_{k}\overset{(i)}{\leq}\lfloor\frac{\sqrt{T}}{L}\rfloor L=% \sqrt{T}, (45)

where (i) follows from the fact that \Delta_{k}\leq L since c{{}^{\prime}}(s,a)+\lambda_{1}+\lambda_{2}\leq 1 for all s\in\Scal and a\in\Acal.

To prove that \sum_{k=\lfloor\frac{\sqrt{T}}{L}\rfloor+1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}% }\not\in\Pcal_{\mathbb{H}}^{k})=0 with high probability, we proceed as follows. From Lemma A.5, which is given at the end of this proof, we have

 \displaystyle\text{Pr}(p^{*}_{\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{k})\leq% \frac{\delta}{t_{k}^{6}}, (46)

where t_{k}=(k-1)L is the start time of episode k. Therefore, if wollos that

 \displaystyle\text{Pr}\left(\sum_{k=\lfloor\frac{\sqrt{T}}{L}\rfloor+1}^{K}% \Delta_{k}\II(p^{*}_{\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{k})=0\right) \displaystyle=\text{Pr}\left(\forall k:\left\lfloor\frac{\sqrt{T}}{L}\right% \rfloor+1\leq k\leq K=\frac{T}{L};\;p^{*}_{\mathbb{H}}\in\Pcal_{\mathbb{H}}^{k% }\right) (47) \displaystyle=1-\text{Pr}\left(\exists k:\left\lfloor\frac{\sqrt{T}}{L}\right% \rfloor+1\leq k\leq\frac{T}{L};\;p^{*}_{\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{% k}\right) (48) \displaystyle\overset{(i)}{\geq}1-\sum_{k=\lfloor\frac{\sqrt{T}}{L}\rfloor+1}^% {\frac{T}{L}}\text{Pr}(p^{*}_{\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{k}) (49) \displaystyle\overset{(ii)}{\geq}1-\sum_{k=\lfloor\frac{\sqrt{T}}{L}\rfloor+1}% ^{\frac{T}{L}}\frac{\delta}{t_{k}^{6}} (50) \displaystyle\overset{(iii)}{=}1-\sum_{t_{k}=\sqrt{T}}^{T}\frac{\delta}{t_{k}^% {6}}\geq 1-\int_{\sqrt{T}}^{T}\frac{\delta}{t^{6}}d_{t}\geq\frac{\delta}{5T^{5% /4}}. (51)

where (i) follows from a union bound, (ii) follows from Eq. 46 and (iii) holds using that t_{k}=(k-1)L. Hence, with probability at least 1-\frac{\delta}{5T^{5/4}} we have that

 \displaystyle\sum_{k=\lfloor\frac{\sqrt{T}}{L}\rfloor+1}^{K}\Delta_{k}\II(p^{*% }_{\mathbb{H}}\not\in\Pcal_{\mathbb{H}}^{k})=0 (52)

If we combine the above equation and Eq. 45, we can conclude that, with probability at least 1-\frac{\delta}{5T^{5/4}}, we have that

 \displaystyle\sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\not\in\Pcal^{k}_{% \mathbb{H}})\leq\sqrt{T} (53)

Next, if we combine Eqs. 43 and 53, we have that

 \displaystyle R(T)=\sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\in\Pcal_{% \mathbb{H}}^{k})+\sum_{k=1}^{K}\Delta_{k}\II(p^{*}_{\mathbb{H}}\not\in\Pcal_{% \mathbb{H}}^{k}) \displaystyle<12L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}\right)% }+\sqrt{T} (54) \displaystyle<13L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}\right)} (55)

Finally, since \sum_{T=1}^{\infty}\frac{\delta}{5T^{5/4}}\leq\delta, with probability 1-\delta, we have that R(T)<13L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}\right)}. This concludes the proof. ∎

{lemma}

It holds that

 \min\left\{T,\sum_{k=1}^{K}L\cdot\EE\left[\sum_{t=0}^{L-1}\II(d_{t}=\mathbb{H}% )\min\{1,\beta_{k}(s_{t},\delta)\}{\,|\,}d_{-1},\,s_{0}\right]\right\}\leq 12L% \sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}\right)}.
###### Proof.

The proof is adapted from Osband et al. (2013). We first note that

Then, we bound the first term of the above equation

 \displaystyle L\cdot\EE\left[\sum_{k=1}^{K}\sum_{t=0}^{L-1}\II(d_{t}=\mathbb{H% })\II(N_{k}(s_{t})\leq L){\,|\,}d_{-1},\,s_{0}\right] \displaystyle=L\cdot\EE\left[\sum_{s\in\Scal}\{\text{\# of times $s$ is % visited while $d=\mathbb{H}$ and $N_{k}(s)\leq L$}\}{\,|\,}d_{-1},\,s_{0}\right] \displaystyle\leq L\cdot\EE\left[|\Scal|\cdot 2L\right]=2L^{2}|\Scal|. (57)

To bound the second term, we first define n_{\tau}(s) as the number of times s has been visited in the first \tau steps across episodes, \ie, if we are at the t^{\text{th}} time step in the episode k, then \tau=t_{k}+t, where t_{k}=(k-1)L, and note that

 \displaystyle n_{t_{k}+t}(s)\leq N_{k}(s)+t (58)

because we will visit state s with human control at most t\in\{0,\ldots,L-1\} times within episode k. Now, if N_{k}(s)>L, we have that

 \displaystyle n_{t_{k}+t}(s)+1\leq N_{k}(s)+t+1\leq N_{k}(s)+L\leq 2N_{k}(s). (59)

Hence we have,

 \displaystyle\II(d_{t}=\mathbb{H})\II(N_{k}(s_{t})>L)(n_{t_{k}+t}(s_{t})+1)% \leq 2N_{k}(s_{t})\implies\frac{\II(d_{t}=\mathbb{H})\II(N_{k}(s_{t})>L)}{N_{k% }(s_{t})}\leq\frac{2}{n_{t_{k}+t}(s_{t})+1} (60)

Then, using the above equation, we can bound the second term in Eq. A.5:

 \displaystyle L\cdot\EE\left[\sum_{k=1}^{K}\sum_{t=0}^{L-1}\II(d_{t}=\mathbb{H% })\II(N_{k}(s_{t})>L)\beta_{k}(s_{t},\delta){\,|\,}d_{-1},\,s_{0}\right] \displaystyle\overset{(i)}{=}L\cdot\EE\left[\sum_{k=1}^{K}\sum_{t=0}^{L-1}% \sqrt{\frac{\II(d_{t}=\mathbb{H})\II(N_{k}(s_{t})>L)14|\Acal|\log\left(\frac{|% \Scal|t_{k}}{\delta}\right)}{\max\{1,N_{k}(s_{t})\}}}\right] \displaystyle\overset{(ii)}{\leq}L\cdot\EE\left[\sum_{k=1}^{K}\sum_{t=0}^{L-1}% \sqrt{\frac{28|\Acal|\log\left(\frac{|\Scal|t_{k}}{\delta}\right)}{n_{t_{k}+t}% (s_{t})+1}}\right] \displaystyle\overset{(iii)}{\leq}L\cdot\sqrt{28|\Acal|\log\left(\frac{|\Scal|% T}{\delta}\right)}\EE\left[\sum_{k=1}^{K}\sum_{t=0}^{L-1}\sqrt{\frac{1}{n_{t_{% k}+t}(s_{t})+1}}\right], (61)

where (i) follows from the definition of \beta_{k}(s_{t},\delta), (ii) follows from Eq. 60, and (iii) follows from the fact that

 \sqrt{28|\Acal|\log\left(\frac{|\Scal|t_{k}}{\delta}\right)}\leq\sqrt{28|\Acal% |\log\left(\frac{|\Scal|T}{\delta}\right)},

using that t_{k}\leq T. Next, we can further bound \EE\left[\sum_{k=1}^{K}\sum_{t=0}^{L-1}\sqrt{\frac{1}{n_{t_{k}+t}(s_{t})+1}}\right] as follows:

 \displaystyle\EE\left[\sum_{k=1}^{K}\sum_{t=0}^{L-1}\sqrt{\frac{1}{n_{t_{k}+t}% (s_{t})+1}}\right]=\EE\left[\sum_{\tau=0}^{T}\sqrt{\frac{1}{n_{\tau}(s_{\tau})% +1}}\right] \displaystyle\overset{(i)}{=}\EE\left[\sum_{s\in\Scal}\sum_{\nu=0}^{N_{T+1}(s)% }\sqrt{\frac{1}{\nu+1}}\right] \displaystyle=\sum_{s\in\Scal}\EE\left[\sum_{\nu=0}^{N_{T+1}(s)}\sqrt{\frac{1}% {\nu+1}}\right] \displaystyle\overset{(ii)}{\leq}\sum_{s\in\Scal}\EE\left[\int_{0}^{N_{T+1}(s)% }\sqrt{\frac{1}{x}}dx\right] \displaystyle\leq\sum_{s\in\Scal}\EE\left[2\sqrt{N_{T+1}(s)}\right] \displaystyle\leq\EE\left[2\sqrt{|\Scal|\sum_{s\in\Scal}N_{T+1}(s)}\right] \displaystyle\overset{(iii)}{=}\EE\left[2\sqrt{|\Scal|T}\right]=2\sqrt{|\Scal|% T}. (62)

where (i) follows from summing over states instead of time and from the fact that we visit each state s exactly N_{T+1}(s) times after K episodes, (ii) follows from Jensen’s inequality and (iii) follows from the fact that \sum_{s\in\Scal}N_{T+1}(s)=T. Next, we combine Eqs A.5 and A.5 to obtain

 \displaystyle L\cdot\EE\left[\sum_{k=1}^{K}\sum_{t=0}^{L-1}\II(d_{t}=\mathbb{H% })\II(N_{k}(s_{t})>L)\beta_{k}(s_{t},\delta){\,|\,}d_{-1},\,s_{0}\right] \displaystyle\leq L\sqrt{28|\Acal|\log\left(\frac{|\Scal|T}{\delta}\right)}% \times 2\sqrt{|\Scal|T} \displaystyle=\sqrt{112}L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta% }\right)}. (63)

Further, we plug in Eqs. 57 and A.5 in Eq.A.5

 \displaystyle L\cdot\EE\left[\sum_{k=1}^{K}\sum_{\tau=0}^{L-1}\II(d_{\tau}=h)% \min\{1,\beta_{t_{k}}(s_{\tau},\delta)\}{\,|\,}d_{-1},\,s_{0}\right]\leq 2L^{2% }|\Scal|+\sqrt{112}L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}% \right)} (64)

Thus,

 \displaystyle\min\left\{T,L\cdot\EE\left[\sum_{k=1}^{K}\sum_{\tau=0}^{L-1}\II(% d_{\tau}=h)\min\{1,\beta_{t_{k}}(s_{\tau},\delta)\}{\,|\,}d_{-1},\,s_{0}\right% ]\right\} \displaystyle\leq\min\left\{T,2L^{2}|\Scal|+\sqrt{112}L\sqrt{|\Scal||\Acal|T% \log\left(\frac{|\Scal|T}{\delta}\right)}\right\} (65)

Moreover, if T\leq 2L^{2}|\Scal||\Acal|\log\left(\frac{|\Scal|T}{\delta}\right),

 \displaystyle T^{2}\leq 2L^{2}|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}% \right)\implies T\leq\sqrt{2}L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{% \delta}\right)}

and if T>2L^{2}|\Scal||\Acal|\log\left(\frac{|\Scal|T}{\delta}\right),

 \displaystyle 2L^{2}|\Scal|<\frac{\sqrt{2L^{2}|\Scal||\Acal|T\log\left(\frac{|% \Scal|T}{\delta}\right)}}{|\Acal|\log\left(\frac{|\Scal|T}{\delta}\right)}\leq% \sqrt{2}L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}\right)}. (66)

Thus, the minimum in Eq. 65 is less than

 \displaystyle(\sqrt{2}+\sqrt{112})L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal% |T}{\delta}\right)}<12L\sqrt{|\Scal||\Acal|T\log\left(\frac{|\Scal|T}{\delta}% \right)} (67)

This concludes the proof. ∎

{lemma}

For each episode k>1, the true human policy p^{*}_{\mathbb{H}} lies in the confidence set \Pcal_{\mathbb{H}}^{k} with probability at least 1-\frac{\delta}{t_{k}^{6}}, where t_{k}=(k-1)L, is the beginning time of episode k.

###### Proof.

We adapt the proof from Lemma 17 in Jaksch et al. (2010). We note that,

 \displaystyle\text{Pr}(p^{*}_{\mathbb{H}}\not\in\Pcal^{k}_{\mathbb{H}}) \displaystyle\overset{(i)}{=}\text{Pr}\left(\bigcup_{s\in\Scal}\left\lVert p^{% *}_{\mathbb{H}}(\cdot{\,|\,}s)-\hat{p}^{k}_{\mathbb{H}}(\cdot{\,|\,}s)\right% \rVert_{1}\geq\beta_{k}(s,\delta)\right) (68) \displaystyle\overset{(ii)}{\leq}\sum_{s\in\Scal}\text{Pr}\left(\left\lVert p^% {*}_{\mathbb{H}}(\cdot{\,|\,}s)-\hat{p}^{k}_{\mathbb{H}}(\cdot{\,|\,}s)\right% \rVert_{1}\geq\sqrt{\frac{14|\Acal|\log\left(\frac{|S|t_{k}}{\delta}\right)}{% \max\{1,N_{k}(s)\}}}\right) (69) \displaystyle\overset{(iii)}{\leq}\sum_{s\in\Scal}\sum_{n=0}^{t_{k}}\text{Pr}% \left(\left\lVert p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)-\hat{p}^{k}_{\mathbb{H}}(% \cdot{\,|\,}s)\right\rVert_{1}\geq\sqrt{\frac{14|\Acal|\log\left(\frac{|S|t_{k% }}{\delta}\right)}{\max\{1,n\}}}\right) (70)

where (i) follows from the definition of the confidence set, \ie, he true human policy does not lie in the confidence set if there is at least one state s in which \left\lVert p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)-\hat{p}^{k}_{\mathbb{H}}(\cdot{% \,|\,}s)\right\rVert_{1}\geq\beta_{k}(s,\delta), (ii) follows from the definition of \beta_{k}(s,\delta) and a union bound over all s\in\Scal and (iii) follows from a union bound over all possible values of N_{k}(s). Now, recall that, for N_{k}(s)=0, we had defined the empirical distribution \hat{p}_{\mathbb{H}}^{k}(\cdot{\,|\,}s)=\frac{1}{|\Acal|}. So, we split the sum into n=0 and n>0:

 \displaystyle\sum_{s\in\Scal}\sum_{n=0}^{t_{k}}\text{Pr} \displaystyle\left(\left\lVert p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)-\hat{p}^{k}_{% \mathbb{H}}(\cdot{\,|\,}s)\right\rVert_{1}\geq\sqrt{\frac{14|\Acal|\log\left(% \frac{|S|t_{k}}{\delta}\right)}{\max\{1,n\}}}\right) \displaystyle=\sum_{s\in\Scal}\sum_{n=1}^{t_{k}}\text{Pr}\left(\left\lVert p^{% *}_{\mathbb{H}}(\cdot{\,|\,}s)-\hat{p}^{k}_{\mathbb{H}}(\cdot{\,|\,}s)\right% \rVert_{1}\geq\sqrt{\frac{14|\Acal|\log\left(\frac{|S|t_{k}}{\delta}\right)}{n% }}\right) \displaystyle\qquad\qquad+\overbrace{\sum_{s\in\Scal}\text{Pr}\left(\left% \lVert p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)-\frac{1}{|\Acal|}\right\rVert_{1}\geq% \sqrt{14|\Acal|\log\left(\frac{|S|t_{k}}{\delta}\right)}\right)}^{\text{if }n=0} \displaystyle\overset{(i)}{=}\sum_{s\in\Scal}\sum_{n=1}^{t_{k}}\text{Pr}\left(% \left\lVert p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)-\hat{p}^{k}_{\mathbb{H}}(\cdot{% \,|\,}s)\right\rVert_{1}\geq\sqrt{\frac{14|\Acal|\log\left(\frac{|S|t_{k}}{% \delta}\right)}{n}}\right)+0 (71) \displaystyle\overset{(ii)}{\leq}t_{k}|S|2^{|\Acal|}\exp\left(-7|\Acal|\log% \left(\frac{|S|t_{k}}{\delta}\right)\right)\leq\frac{\delta}{t_{k}^{6}}. (72)

where (i) follows from the fact that \left\lVert p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)-\frac{1}{|\Acal|}\right\rVert_{1% }<\sqrt{14|\Acal|\log\left(\frac{|S|t_{k}}{\delta}\right)} for any non trivial MDP. More specifically,

 \displaystyle|\Acal|\geq 1,\,|S|\geq 2,\,\delta<1,\,t_{k}>1\implies\sqrt{14|% \Acal|\log\left(\frac{|S|t_{k}}{\delta}\right)}>\sqrt{14\log(2)}>2, \displaystyle\left\lVert p^{*}_{\mathbb{H}}(\cdot{\,|\,}s)-\frac{1}{|\Acal|}% \right\rVert_{1}\leq\sum_{a\in\Acal}\left(p^{*}_{\mathbb{H}}(a{\,|\,}s)+\frac{% 1}{|\Acal|}\right)\leq 2, (73)

and (ii) follows from the fact that, after observing n samples, the L^{1}-deviation of the true distribution p^{*} from the empirical one \hat{p} over k events is bounded by:

 \displaystyle\text{Pr}\left(\left\lVert p^{*}(\cdot)-\hat{p}(\cdot)\right% \rVert_{1}\geq\epsilon\right)\leq 2^{k}\exp\left({-n\frac{\epsilon^{2}}{2}}\right) (74)

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters