# Online Optimal Task Offloading with One-bit Feedback

###### Abstract

Task offloading is an emerging technology in fog-enabled networks. It allows users transmit tasks to neighbor fog nodes so as to utilize the computing resource of the networks. In this paper, we investigate a stochastic task offloading model and propose a multi-armed bandit framework to formulate this model. We consider different helper nodes prefer different kinds of tasks and feedback one-bit information to task node named happiness of nodes. The key challenge of this problem is an exploration-exploitation tradeoff. Thus we implement a UCB-type algorithm to maximize the long-term happiness metric. Further more, we prove that this UCB-type algorithm is asymptotically optimal. Numerical simulations are given in the end of the paper to corroborate our strategy.

Online Optimal Task Offloading with One-bit Feedback

Shangshu Zhao, Zhaowei Zhu, Fuqian Yang and Xiliang Luo |

ShanghaiTech University, Shanghai, China |

Email: luoxl@shanghaitech.edu.cn |

Index Terms— Online learning.

## 1 Introduction

With the rapid evolution of Internet of Things (IoT), 5G wireless systems and the embedded artificial intelligence in recent years, the data processing capability is required for mobile devices [1]. To exploit the benefit of all the available computational resources, fog computing (or mobile edge computing) has been considered to be a potential solution to enable computation-intensive and latency-critical applications at the battery-empowered mobile devices [2].

Fog computing promises dramatic reduction in latency and mobile energy consumption via offloading computation tasks [3]. Recently, many works have been carried out for task offloading in fog computing [9, 8, 7, 6, 5, 4]. Among the literatures, some works considered the energy issues and formulated task offloading as deterministic optimization problems [5, 4]. Considering the real-time states of users and servers, the task offloading problem becomes a typical stochastic optimization problem. To make the problem tractable, the Lyapunov optimization method was invoked in [9, 8, 7, 6] to transform the challenging stochastic optimization problem to a sequential decision problem, which included a series of deterministic problems at each time slot.

The above literatures all assumed perfect knowledge of system parameters, e.g. the dedicated modeling among latency, energy consumption, and computational resources. However, the models and system parameters in practice may be too complicated for an individual user to characterize. For example, the communication delay and the computation delay were modeled as bandit feedbacks in [10], which were only revealed for the nodes that were queried. Without assuming dedicated system models and perfect knowledge of parameters, the tradeoff between learning the system and pursuing the empirically best offloading strategy was investigated under the bandit model in [11]. The exploration versus exploitation tradeoff in [11] was addressed based on the framework of multi-armed bandit (MAB), which had been extensively studied in statistics [13, 14, 12].

In practice, it is impossible for each fog node to provide its preference of tasks as numerical values or concrete functions. Hence, whether a fog node is willing to execute the offloaded task or not is tough to predict. In this paper, we assume that the performance of offloading each task is quantified to one-bit information delivered as the feedback of task offloading. The value of the feedback is a random variable, which is correlated to some particular features of the task itself and the node to process this task. We endeavor to make online decisions to maximize the long-term performance of task offloading with these probabilistic feedbacks. Our main contribution is summarized as follows. Firstly, we apply a bandit learning method to capture the unknown effect of the concerned features, and model the node uncertainty by a logit model, which is more practical than the previous model-based ones, e.g. [4, 5, 6, 7]. Secondly, we extend the algorithm proposed in [15] to make it suitable for learning the feature weights of different fog nodes. We further analyze our proposed algorithm and find the corresponding performance guarantees for our extension.

The rest of this paper is organized as follows. Section 2 introduces the system model and assumptions. Our algorithm and the related guarantees are introduced in section 3. We then present and analyze numerical results in section 5 and conclusion in section 6.

Notations: Notations , , , , and stand for the transpose of matrix , the cardinality of the set , the norm of vector defined as for a positive definite matrix , the probability of event , and the expectation of a random variable . Indicator function takes the value of () when the specified condition is met (otherwise).

## 2 System Model

We consider the task offloading problem in a network including fog nodes, i.e. one task node and helper nodes. See Fig. 1 for an example. Define the set of fog nodes as

(1) |

In each time slot, the task node generates one task and intelligently chooses one fog node to execute this task. The helper nodes can also generate tasks occasionally. In this paper, we focus on the offloading decisions of the task node. Thus the tasks generated by helper nodes are assumed to be processed locally. Besides, all the tasks are cached and executed in a first-input first-output (FIFO) queue.

In time slot-, the task node offloads a task to a particular node- and receives a one-bit feedback . This one-bit feedback indicates whether the node feels optimistic about the current task. Without loss of generality, we use to denote the node is happy (unhappy). Note the feedback is required to be delivered immediately after receiving the offloaded task. As shown in Fig. 1, the feedback is determined jointly by the task status and the node status. To represent the factors affecting feedback , we combine both statuses and model them as the feature vector . The features may include many attributes, e.g. queue length, data length, task complexity, central processing unit (CPU) frequency, channel quality information (CQI). Each element in is normalized such that . Note in real life, different kinds of computing nodes certainly have different preferences for tasks, which are reflected in different weights in the same feature. To capture the users’ preferences, the weight vector is employed, each element of which evaluates the weight of the corresponding feature in . Similar to , we also normalize such that .

We assume the evaluation scheme of each node is under logit model, which is a commonly used binary classifier [15]. Relying on this model, the probability of the feeding back or is given as follows.

(2) |

where the pair is chosen from the set

(3) |

## 3 Problem Formulation

Our goal is to maximize the long-term happiness metric. Consider a time range . The maximization of the long-term happiness metric is formulated as follows.

(4) |

There are two difficulties raising from (4). To begin with, the weight vector is unknown to the task node. Furthermore, the offloading decision is required at the beginning of each time slot and cannot be altered afterwards. Thus it is necessary to learn the weight vectors along with task offloading. To deal with the latter difficulty, we turn to solve the following problem as an alternative.

(5) |

Although this problem is not exactly the same as the original one in (4), it is a common way as indicated in [7, 8, 9, 11]. Meanwhile, under the stochastic framework [14], it is more natural to focus on the expectation, i.e. . Note the expected happiness metric of each arm has to be estimated based on the historical feedback. Thus there is an exploration-exploitation tradeoff in (5). On the one hand, the task node tends to choose the best node according to the historical information. On the other hand, more tests to unfamiliar nodes may bring task node extra rewards. Plenty works have been done to deal with this kind of exploration-exploitation tradeoff problem under the MAB framework [11, 12, 13, 14, 15]. In the rest of the paper, we endeavor to address this tradeoff through bandit methods.

##
4 Online task offloading

with guarantees

### 4.1 Task Offloading with One-bit Feedback

This exploration-exploitation tradeoff could be handled by a stationary multi-armed bandit (MAB) model, where each node can be seen as one arm. Delivering one task is analogized to testing one arm and the task node makes decisions based on all the feedbacks it has received.

Recall that the feedback of each node follows the logit model in (2). Thus, the loss function in time slot- is defined as

(6) |

Given the first observations of feedbacks, i.e. , the weight vector of each fog node can be approximated by its maximum likelihood estimate as follows.

(7) |

Clearly, this approach needs to optimize over all the historical feedbacks, which is not scalable. To make it capable of online updating, we refer to [15] and propose an approximate sequential MLE solution as

(8) |

where

(9) |

As indicated in (5), our goal is to maximize the expectation of instantaneous happiness metric, which is positively correlated to the probability of . Additionally, the metric is positively correlated to as well. Then in time slot-, the task node chooses one fog node based on the feature by solving the following optimization problem:

(10) |

Note is just a temporal variable that does not engage in the updates of any variables. Essentially, we are only interested in the index of the node, i.e. . The domain is

(11) |

where is the feasible region for weight estimations. To highlight the benefit of exploration, the domain should be a ball centered on . Specifically, the ball is characterized as

(12) |

Note is an important parameter, the value of which determines the performance of our proposed algorithm. Details about and the corresponding theoretical guarantees will be introduced later in section 4.2. Based on the feasible region of defined in (12), we can find the node index revealed in (10) as follows.

(13) |

The proposed strategy, i.e. Task Offloading with One-bit Feedback (TOOF), is summarized in Algorithm 1.

### 4.2 Theoretical Guarantees

We first analyze the convergence of in Proposition 1. The proof can be finished by following the proof of Theorem 1 in [15].

###### Proposition 1.

With a probability at least , we have

(14) |

where

(15) |

(16) |

Proposition 1 indicates that the width of the confidence region, i.e. , is in the order of , where is a particular constant. If the weight vector of each node is perfectly observed, the task node can pick a node with the maximal probability of positive feedback. Thus, we define the optimal node in time slot- as node- such that

(17) |

where the domain is defined in (3). Accordingly, the instantaneous regret function could be written as follows.

(18) |

The upper bound on regret is given in proposition 2.

###### Proposition 2.

With a probability at least , the average regret is upper-bounded as

(19) |

where .

This proposition implies the average regret approaches to zero as the time goes to infinity with overwhelming probability. The proof is given in the Appendix.

## 5 Numerical Results

In this section, we justify the performance of our algorithm by testing tasks and compare the performance with other algorithms. The tasks are allocated to fog nodes on demand. Besides, we assume that data length obeys KB. For each task, is set to consist of five features. The specific features and the corresponding correlations to happinesses are shown in Table 1. The parameter is introduced to make sure that is invertible and barely affects the performance of our algorithm. Hence, we simply choose according to [15]. The parameter is tuned to be according to (15) where . It is worth noting that the value of has the same order as that in (15) instead of the exact value. This is due to the fact that the in (15) only provides an upper bound on the estimation error of , which may not be tight enough in terms of the constant aforementioned in Proposition 1.

Feature | Task Length | Task Complexity | Queue Length | CPU Frequency | CQI |
---|---|---|---|---|---|

Correlations |

In Fig. 2, we compare the performance of the TOOF algorithm with Round-Robin and Greedy. In the round-robin algorithm, nodes are chosen in a cyclic sequence regardless of their current states. In the greedy algorithm, the task node chooses a helper node in each time slot under the same rule as TOOF, but stays the same over time. It means that each single element of the estimated weight vector is updated in the same pace. Clearly, Fig. 2 indicates that our proposed TOOF algorithm shows the tendency of converging to zero. Besides, the TOOF algorithm achieves much lower regret than the other two algorithms.

The performance of our proposed scheme is also justified by the comparison among rewards in Fig. 3. Comparing with (4), we can find that the reward defined in Fig. 3 is also a happiness metric by denoting happy (unhappy) by . As a reference, Optimal is employed to show the performance in the case of perfect knowledge, where the node is chosen as (17). Our algorithm begins to show its superiority to the greedy algorithm since time slot- and keeps widening the gap. Fig. 3 also illustrates that the reward obtained via the TOOF algorithm approaches the optimal one. This phenomenon demonstrates that, with the increment of the number of tasks, our TOOF algorithm is capable of dealing with the tradeoff between learning system parameters and getting a high immediate reward.

## 6 Conclusions

In this paper, an efficient task offloading strategy with one-bit feedback and the corresponding performance guarantee have been investigated. With the unknown weight vectors of helper nodes and probabilistic feedback, a multi-armed bandit framework has been proposed. we implemented an efficient TOOF algorithm raising from a UCB-type algorithm. We have also proven that the upper bound of the average regret function is in the order of What’s more, we demonstrated the numerical simulation and show the superiority of our TOOF algorithm.

## References

- [1] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864, Dec. 2016.
- [2] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. S. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Trans. Commun., vol. 65, no. 8, pp. 3571–3584, Aug. 2017.
- [3] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, Aug. 2017.
- [4] Y. Yang, K. Wang, G. Zhang, X. Chen, X. Luo, and M. Zhou, “MEETS: Maximal energy efficient task scheduling in homogeneous fog networks,” submitted to IEEE Internet Things J., 2017.
- [5] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,” IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1397–1411, Mar. 2017.
- [6] J. Kwak, Y. Kim, J. Lee, and S. Chong, “DREAM: Dynamic resource and task allocation for energy minimization in mobile cloud systems,” IEEE J. Sel. Areas Commun, vol. 33, no. 12, pp. 2510–2523, Dec. 2015.
- [7] Y. Mao, J. Zhang, S. H. Song, and K. B. Letaief, “Stochastic joint radio and computational resource management for multi-user mobile-edge computing systems,” IEEE Trans. Wireless Commun., vol. 16, no. 9, pp. 5994–6009, Sept. 2017.
- [8] Y. Yang, S. Zhao, W. Zhang, Y. Chen, X. Luo, and J. Wang, “DEBTS: Delay energy balanced task scheduling in homogeneous fog networks,” IEEE Internet Things J., in press.
- [9] L. Pu, X. Chen, J. Xu, and X. Fu, “D2D fogging: An energy-efficient and incentive-aware task offloading framework via network-assisted D2D collaboration,” IEEE J. Sel. Areas Commun., vol. 34, no.12, pp. 3887–3901, Dec. 2016.
- [10] T. Chen and G. B. Giannakis, “Bandit convex optimization for scalable and dynamic IoT management”, IEEE Internet Things J., in press.
- [11] Z. Zhu, T. Liu, S. Jin, and X. Luo, “Learn and pick right nodes to offload”, arXiv preprint arXiv:1804.08416, 2018.
- [12] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Mach. Learn., vol. 47, no. 2, pp. 235–256, May 2002.
- [13] D. A. Berry and B. Fristedt, Bandit Problems: Sequential Allocation of Experiments. London, U.K.: Chapman & Hall, 1985.
- [14] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Found. Trends Mach. Learn., vol. 5, no. 1, pp. 1–122, 2012.
- [15] L. Zhang, T. Yang, R. Jin, Y. Xiao, and Z. Zhou, “Online stochastic linear optimization under one-bit feedback,” Proc. ICML, New York, NY, USA, Jun. 2016, pp. 392–401.
- [16] Y. Abbasi-Yadkori, D. PÃ¡l, and C. SzepesvÃ¡ri, “Improved algorithms for linear stochastic bandits,” Advances in Neural Information Processing Systems, Granada, Spain, Dec. 2011.

## Appendix

With a probability at least , the instantaneous regret can be upper-bounded as follows.

(20) |

where holds due to the CauchyâSchwarz inequality. The inequality holds with a probability at least according to Proposition 1, i.e. , , and .

On the other hand, the following inequality always holds:

(21) |

where holds due to that and have been normalized.

Thus the total regret can be upper-bounded by

(22) |