###### Abstract

Task offloading is an emerging technology in fog-enabled networks. It allows users transmit tasks to neighbor fog nodes so as to utilize the computing resource of the networks. In this paper, we investigate a stochastic task offloading model and propose a multi-armed bandit framework to formulate this model. We consider different helper nodes prefer different kinds of tasks and feedback one-bit information to task node named happiness of nodes. The key challenge of this problem is an exploration-exploitation tradeoff. Thus we implement a UCB-type algorithm to maximize the long-term happiness metric. Further more, we prove that this UCB-type algorithm is asymptotically optimal. Numerical simulations are given in the end of the paper to corroborate our strategy.

 Shangshu Zhao, Zhaowei Zhu, Fuqian Yang and Xiliang Luo ShanghaiTech University, Shanghai, China Email: luoxl@shanghaitech.edu.cn

Index Terms—  Online learning.

## 1 Introduction

With the rapid evolution of Internet of Things (IoT), 5G wireless systems and the embedded artificial intelligence in recent years, the data processing capability is required for mobile devices [1]. To exploit the benefit of all the available computational resources, fog computing (or mobile edge computing) has been considered to be a potential solution to enable computation-intensive and latency-critical applications at the battery-empowered mobile devices [2].

The above literatures all assumed perfect knowledge of system parameters, e.g. the dedicated modeling among latency, energy consumption, and computational resources. However, the models and system parameters in practice may be too complicated for an individual user to characterize. For example, the communication delay and the computation delay were modeled as bandit feedbacks in [10], which were only revealed for the nodes that were queried. Without assuming dedicated system models and perfect knowledge of parameters, the tradeoff between learning the system and pursuing the empirically best offloading strategy was investigated under the bandit model in [11]. The exploration versus exploitation tradeoff in [11] was addressed based on the framework of multi-armed bandit (MAB), which had been extensively studied in statistics [13, 14, 12].

The rest of this paper is organized as follows. Section 2 introduces the system model and assumptions. Our algorithm and the related guarantees are introduced in section 3. We then present and analyze numerical results in section 5 and conclusion in section 6.

Notations: Notations , , , , and stand for the transpose of matrix , the cardinality of the set , the norm of vector defined as for a positive definite matrix , the probability of event , and the expectation of a random variable . Indicator function takes the value of () when the specified condition is met (otherwise).

## 2 System Model

We consider the task offloading problem in a network including fog nodes, i.e. one task node and helper nodes. See Fig. 1 for an example. Define the set of fog nodes as

In each time slot, the task node generates one task and intelligently chooses one fog node to execute this task. The helper nodes can also generate tasks occasionally. In this paper, we focus on the offloading decisions of the task node. Thus the tasks generated by helper nodes are assumed to be processed locally. Besides, all the tasks are cached and executed in a first-input first-output (FIFO) queue.

We assume the evaluation scheme of each node is under logit model, which is a commonly used binary classifier [15]. Relying on this model, the probability of the feeding back or is given as follows.

 Pr[y(t)i=±1|x(t)i]=11+exp(−y(t)iw⊤ix(t)i), (2)

where the pair is chosen from the set

 Dt:={(x(t)i,wi),∀i∈I}. (3)

## 3 Problem Formulation

Our goal is to maximize the long-term happiness metric. Consider a time range . The maximization of the long-term happiness metric is formulated as follows.

 maximize{It}limT→∞ 1T∑t∈Ty(t)Itsubject to(???), It∈I,∀t∈T. (4)

There are two difficulties raising from (4). To begin with, the weight vector is unknown to the task node. Furthermore, the offloading decision is required at the beginning of each time slot and cannot be altered afterwards. Thus it is necessary to learn the weight vectors along with task offloading. To deal with the latter difficulty, we turn to solve the following problem as an alternative.

 maximizeItE[y(t)It]subject to(???), It∈I. (5)

Although this problem is not exactly the same as the original one in (4), it is a common way as indicated in [7, 8, 9, 11]. Meanwhile, under the stochastic framework [14], it is more natural to focus on the expectation, i.e. . Note the expected happiness metric of each arm has to be estimated based on the historical feedback. Thus there is an exploration-exploitation tradeoff in (5). On the one hand, the task node tends to choose the best node according to the historical information. On the other hand, more tests to unfamiliar nodes may bring task node extra rewards. Plenty works have been done to deal with this kind of exploration-exploitation tradeoff problem under the MAB framework [11, 12, 13, 14, 15]. In the rest of the paper, we endeavor to address this tradeoff through bandit methods.

This exploration-exploitation tradeoff could be handled by a stationary multi-armed bandit (MAB) model, where each node can be seen as one arm. Delivering one task is analogized to testing one arm and the task node makes decisions based on all the feedbacks it has received.

Recall that the feedback of each node follows the logit model in (2). Thus, the loss function in time slot- is defined as

 ft(wi)=log(1+exp(−y(t)iwi⊤xi))\mathds1{It=i}. (6)

Given the first observations of feedbacks, i.e. , the weight vector of each fog node can be approximated by its maximum likelihood estimate as follows.

 ¯w(T)i=argmin∥wi∥≤11TT∑t=1ft(wi),∀i∈I. (7)

Clearly, this approach needs to optimize over all the historical feedbacks, which is not scalable. To make it capable of online updating, we refer to [15] and propose an approximate sequential MLE solution as

 ¯w(t+1)i=argmin∥wi∥2≤1∥wi−¯w(t)i∥Z(t)i2+(wi−¯w(t)i)⊤∇ft(¯w(t)i)\mathds1{It=i}, (8)

where

 Z(t+1)i=Z(t)i+β2x(t)i(x(t)i)⊤\mathds1{It=i}. (9)

As indicated in (5), our goal is to maximize the expectation of instantaneous happiness metric, which is positively correlated to the probability of . Additionally, the metric is positively correlated to as well. Then in time slot-, the task node chooses one fog node based on the feature by solving the following optimization problem:

 (x(t)It,^wIt)=argmax(x,w)∈¯Dtw⊤x. (10)

Note is just a temporal variable that does not engage in the updates of any variables. Essentially, we are only interested in the index of the node, i.e. . The domain is

 ¯Dt=⋃i∈I{(x,w)|x=x(t)i,w∈W(t)i}, (11)

where is the feasible region for weight estimations. To highlight the benefit of exploration, the domain should be a ball centered on . Specifically, the ball is characterized as

 W(t)i:={w|∥¯w(t)i−w∥2Z(t)i≤γ(t)i}. (12)

Note is an important parameter, the value of which determines the performance of our proposed algorithm. Details about and the corresponding theoretical guarantees will be introduced later in section 4.2. Based on the feasible region of defined in (12), we can find the node index revealed in (10) as follows.

 It=argmaxi∈I ⎛⎜ ⎜ ⎜⎝max∥¯w(t)i−w∥2Z(t)i≤γ(t)iw⊤x(t)i⎞⎟ ⎟ ⎟⎠=argmaxi∈I ⎛⎝(¯w(t)i)⊤x(t)i−min∥z∥22≤γ(t)i[(√Z(t)i)−1z]⊤x(t)i⎞⎠=argmaxi∈I ((¯w(t)i)⊤x(t)i+√γ(t)i∥x(t)i∥(Z(t)i)−1). (13)

### 4.2 Theoretical Guarantees

We first analyze the convergence of in Proposition 1. The proof can be finished by following the proof of Theorem 1 in [15].

###### Proposition 1.

With a probability at least , we have

 ∥¯w(t+1)i−wi∥2Z(t+1)i≤γ(t+1)i,∀t>0, (14)

where

 γ(t+1)i=⎡⎣8+(8β+163)τt+2βlogdet(Z(t+1)i)det(Z(1)i)⎤⎦+λ, (15)
 τt=log(2⌈2log2t⌉t2δ),  β=12(1+exp(1)). (16)

Proposition 1 indicates that the width of the confidence region, i.e. , is in the order of , where is a particular constant. If the weight vector of each node is perfectly observed, the task node can pick a node with the maximal probability of positive feedback. Thus, we define the optimal node in time slot- as node- such that

 (x(t)I∗t,wI∗t)=argmax(x,w)∈Dtw⊤x, (17)

where the domain is defined in (3). Accordingly, the instantaneous regret function could be written as follows.

 rt=(w⊤I∗txI∗t−w⊤Itx(t)It). (18)

The upper bound on regret is given in proposition 2.

###### Proposition 2.

With a probability at least , the average regret is upper-bounded as

 R(T)=1TT∑t=1rt≤4  ⎷γ(T)βTK∑i=1logdet(Z(T)i)det(Z(1)i), (19)

where .

This proposition implies the average regret approaches to zero as the time goes to infinity with overwhelming probability. The proof is given in the Appendix.

## 5 Numerical Results

In this section, we justify the performance of our algorithm by testing tasks and compare the performance with other algorithms. The tasks are allocated to fog nodes on demand. Besides, we assume that data length obeys KB. For each task, is set to consist of five features. The specific features and the corresponding correlations to happinesses are shown in Table 1. The parameter is introduced to make sure that is invertible and barely affects the performance of our algorithm. Hence, we simply choose according to [15]. The parameter is tuned to be according to (15) where . It is worth noting that the value of has the same order as that in (15) instead of the exact value. This is due to the fact that the in (15) only provides an upper bound on the estimation error of , which may not be tight enough in terms of the constant aforementioned in Proposition 1.

In Fig. 2, we compare the performance of the TOOF algorithm with Round-Robin and Greedy. In the round-robin algorithm, nodes are chosen in a cyclic sequence regardless of their current states. In the greedy algorithm, the task node chooses a helper node in each time slot under the same rule as TOOF, but stays the same over time. It means that each single element of the estimated weight vector is updated in the same pace. Clearly, Fig. 2 indicates that our proposed TOOF algorithm shows the tendency of converging to zero. Besides, the TOOF algorithm achieves much lower regret than the other two algorithms.

The performance of our proposed scheme is also justified by the comparison among rewards in Fig. 3. Comparing with (4), we can find that the reward defined in Fig. 3 is also a happiness metric by denoting happy (unhappy) by . As a reference, Optimal is employed to show the performance in the case of perfect knowledge, where the node is chosen as (17). Our algorithm begins to show its superiority to the greedy algorithm since time slot- and keeps widening the gap. Fig. 3 also illustrates that the reward obtained via the TOOF algorithm approaches the optimal one. This phenomenon demonstrates that, with the increment of the number of tasks, our TOOF algorithm is capable of dealing with the tradeoff between learning system parameters and getting a high immediate reward.

## 6 Conclusions

In this paper, an efficient task offloading strategy with one-bit feedback and the corresponding performance guarantee have been investigated. With the unknown weight vectors of helper nodes and probabilistic feedback, a multi-armed bandit framework has been proposed. we implemented an efficient TOOF algorithm raising from a UCB-type algorithm. We have also proven that the upper bound of the average regret function is in the order of What’s more, we demonstrated the numerical simulation and show the superiority of our TOOF algorithm.

## References

• [1] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864, Dec. 2016.
• [2] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. S. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Trans. Commun., vol. 65, no. 8, pp. 3571–3584, Aug. 2017.
• [3] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, Aug. 2017.
• [4] Y. Yang, K. Wang, G. Zhang, X. Chen, X. Luo, and M. Zhou, “MEETS: Maximal energy efficient task scheduling in homogeneous fog networks,” submitted to IEEE Internet Things J., 2017.
• [5] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,” IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1397–1411, Mar. 2017.
• [6] J. Kwak, Y. Kim, J. Lee, and S. Chong, “DREAM: Dynamic resource and task allocation for energy minimization in mobile cloud systems,” IEEE J. Sel. Areas Commun, vol. 33, no. 12, pp. 2510–2523, Dec. 2015.
• [7] Y. Mao, J. Zhang, S. H. Song, and K. B. Letaief, “Stochastic joint radio and computational resource management for multi-user mobile-edge computing systems,” IEEE Trans. Wireless Commun., vol. 16, no. 9, pp. 5994–6009, Sept. 2017.
• [8] Y. Yang, S. Zhao, W. Zhang, Y. Chen, X. Luo, and J. Wang, “DEBTS: Delay energy balanced task scheduling in homogeneous fog networks,” IEEE Internet Things J., in press.
• [9] L. Pu, X. Chen, J. Xu, and X. Fu, “D2D fogging: An energy-efficient and incentive-aware task offloading framework via network-assisted D2D collaboration,” IEEE J. Sel. Areas Commun., vol. 34, no.12, pp. 3887–3901, Dec. 2016.
• [10] T. Chen and G. B. Giannakis, “Bandit convex optimization for scalable and dynamic IoT management”, IEEE Internet Things J., in press.
• [11] Z. Zhu, T. Liu, S. Jin, and X. Luo, “Learn and pick right nodes to offload”, arXiv preprint arXiv:1804.08416, 2018.
• [12] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Mach. Learn., vol. 47, no. 2, pp. 235–256, May 2002.
• [13] D. A. Berry and B. Fristedt, Bandit Problems: Sequential Allocation of Experiments. London, U.K.: Chapman & Hall, 1985.
• [14] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Found. Trends Mach. Learn., vol. 5, no. 1, pp. 1–122, 2012.
• [15] L. Zhang, T. Yang, R. Jin, Y. Xiao, and Z. Zhou, “Online stochastic linear optimization under one-bit feedback,” Proc. ICML, New York, NY, USA, Jun. 2016, pp. 392–401.
• [16] Y. Abbasi-Yadkori, D. PÃ¡l, and C. SzepesvÃ¡ri, “Improved algorithms for linear stochastic bandits,” Advances in Neural Information Processing Systems, Granada, Spain, Dec. 2011.

## Appendix

With a probability at least , the instantaneous regret can be upper-bounded as follows.

 rt=w⊤I∗tx(t)I∗t−w⊤Itx(t)It≤(^w(t)It)⊤x(t)It−w⊤Itx(t)It=(^w(t)It−¯w(t)It)⊤x(t)It+(¯w(t)It−wIt)⊤x(t)It(a)≤(∥^w(t)It−¯w(t)It∥Z(t)i+∥¯w(t)It−wIt∥Z(t)i)∥x(t)It∥(Z(t)i)−1(b)≤2√γ(t)i∥x(t)It∥(Z(t)It)−1. (20)

where holds due to the CauchyâSchwarz inequality. The inequality holds with a probability at least according to Proposition 1, i.e. , , and .

On the other hand, the following inequality always holds:

 rt=w⊤I∗tx(t)I∗t−w⊤Itx(t)It=w⊤I∗t(x(t)I∗t−x(t)It)+(wI∗t−wIt)⊤x(t)It(c)≤4. (21)

where holds due to that and have been normalized.

Thus the total regret can be upper-bounded by

 T∑t=1rt=T∑t=1(w⊤I∗tx(t)I∗t−w⊤Itx(t)It)≤T∑t=1min(2√γ(t)∥x(t)It∥(Z(t)It)−1,4)≤√8γ(T)βT∑t=1min⎛⎝√β2∥x(t)It∥(Z(t)It)−1,1⎞⎠≤√8γ(T)Tβ ⎷T∑t=1min(β2∥x(t)It∥2(Z(t)It)−1,1). (22)

Similar to the result from Lemma 11 in [16], we have

 det(Z(T+1)i)=det(Z(T)i+β2x(T)i(x(T)i)⊤\mathds1{IT=i})=det(Z(T)i)(1+β2∥x(T)i∥2(Z(T)i)−1\mathds1{IT=i})=det(Z(1)i)T∏t=1(1+β2∥x(t)i∥2(Z(t)i)−1\mathds1{It=i}). (23)

Thus

 T∑t=1min(β2∥x(t)It∥2(Z(t)It)−1,1)≤2T∑t=1log(1+β2∥x(t)It∥2(Z(t)It)−1)=2T∑t=1K∑i=1log(1+β2∥x(t)i∥2(Z(t)i)−1\mathds1{It=i})=2K∑i=1logdet(Z(T)i)det(Z(1)i). (24)

Taking this result to (22) yields

 T∑t=1rt≤4  ⎷γ(T)TβK∑i=1logdet(Z(T)i)det(Z(1)i). (25)
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters