Semantic Compression for EdgeAssisted Systems
Abstract
A novel semantic approach to data selection and compression is presented for the dynamic adaptation of IoT data processing and transmission within “wireless islands”, where a set of sensing devices (sensors) are interconnected through onehop wireless links to a computational resource via a local access point. The core of the proposed technique is a cooperative framework where local classifiers at the mobile nodes are dynamically crafted and updated based on the current state of the observed system, the global processing objective and the characteristics of the sensors and data streams. The edge processor plays a key role by establishing a link between content and operations within the distributed system. The local classifiers are designed to filter the data streams and provide only the needed information to the global classifier at the edge processor, thus minimizing bandwidth usage. However, the better the accuracy of these local classifiers, the larger the energy necessary to run them at the individual sensors. A formulation of the optimization problem for the dynamic construction of the classifiers under bandwidth and energy constraints is proposed and demonstrated on a synthetic example.
I Introduction
The Internet of Things (IoT) paradigm [1] envisions a scenario where machines remotely interact to provide services and perform monitoring and control tasks. To this aim, the IoT realizes a network of data sources, mobile devices, and processing centers interconnected through wireless and wireline links, where local and global algorithms cooperate in a distributed fashion.
Sophisticated large/scale application scenarios such as Smart City systems [2] and intelligent (or autonomous) vehicular networks [3, 4] push the limits of IoT systems in sensing, communication and processing capabilities. To address the need for tight control loops, timely coordination and computationintense processing, Fog and Edge Computing architectures [5, 6] place computation resources at the edge of the wireless access infrastructure. In these architectures, mobile devices can offload computational tasks to edge data processors through onehop lowlatency links. The co/̄location of sensing and processing within a star topology allows reliable local coordination of remote devices informed by global resources, such as databases and data centers in the cloud. However, the limited and timevarying bandwidth available in wireless environments makes the design of edgebased architectures challenging. This especially applies in those scenarios where IoT data streams coexist with other services on the same channel and network resource.
In this paper, we propose a framework for the dynamic adaptation of IoT data processing and transmission within “wireless islands”, where a set of sensing devices (sensors) are interconnected with onehop wireless links to a computational resource through a local access point (e.g., a cellular base station or a Wi/̄Fi access point). We specifically address an application scenario where the sensors and the edge processor cooperatively perform a realtime data acquisition and processing task, such as classification or detection based on environmental observations (see Fig. 1). The challenge, then, is to accomplish such task with the bandwidth, computational power and energy constraints imposed by the limited resources available at the device and network levels.
The core of the framework is a novel “semantic” approach to data selection and compression, where local classifiers at the mobile nodes are dynamically crafted and updated based on the current state of the observed system and its processing objective, together forming a continuously evolving context. The edge processor plays a key role by establishing a link between content and operations within the distributed system. The local classifiers are designed to filter the data streams and provide only the needed information to the global classifier at the edge processor, thus minimizing bandwidth usage. However, the better the accuracy of these local classifiers, the larger the energy necessary to run them at the individual sensors. Our framework builds on recent results [7, 8], where classifier simplifications are applied to the problem of explaining the outcome of black box machine learning algorithms.
An interesting connection can be made to the traditional multimedia compression techniques, where the components imperceivable by humans are removed. Thus, distortion of the original signal is accepted in those regions that are not needed by the final application. This research extends this principle to data consumed by machines for general computational purposes. Additionally, we expand the traditional focus on bandwidth compression by itself with the notion of energyawareness.
The rest of the paper is organized as follows. Section II introduces the general scenario and describes the problem addressed herein. In Section III, we present the semantic compression framework, and illustrate its key components on an examplary problem in Section IV. Section V concludes the paper.
Ii Problem Formulation
Recent advances in machine learning resulted in sophisticated models, which provide incredibly capable detectors of interest to IoT applications, particularly for image and video processing. Instead of working only for niche or synthetic settings, these classifiers are able to handle realworld input from a large variety of environments. As a consequence, the resulting classifiers often tend to be too complex in structure, and can only reside on devices capable of handling computationallyintense tasks. However, mobile sensors collecting the data for processing have only limited observational power, computational capabilities, and energy availability. Hence, due to constraints in these resources, they often cannot support such complex classifiers. Fog and Edge architectures offer a solution to this issue by introducing computational resources within the local wireless island. However, bandwidth constraints, often imposed by other competing services, limit the data that can be transferred from the sensors to the computational resources. In these circumstances, prefiltering the data at the sensors becomes necessary to avoid delay, data loss, or undesirable disruption of other wireless services.
A sketch of the architecture at the center of our studies is in Fig. 1, where a set of sensors acquire observations in some dynamic environment. The sensors are wirelessly interconnected through a local access point (e.g., base station) to an edge processor. The edge processor is assigned a computational task (possibly changing in time), such as the identification of human activities in public parks or traffic dangers in autonomous vehicles’ networks. This task corresponds to one or more classifiers taking the data streams from the sensing devices as their inputs. The goal of the global classifiers is to achieve an average accuracy , measured in terms of classification errors.
For the sake of explanation ease, we introduce the notion of temporal period, where time is discretized and indexed with . The sensors are connected to the edge processor through wireless links of capacity , , in the period . A constraint on the overall capacity available to the sensors, where , can be introduced to capture channel sharing. The signal acquired by a sensor in the time period is . Each sensor has an energy storage for processing and transmission, where the amount of energy available at sensor in period is equal to . The energy storage can be refilled through charging or energy harvesting, modeled as a random arrival process. The goal of the system is to guarantee the wanted accuracy at the edge processor using the available bandwidth and energy. Fig. 2 illustrates the components of the system for an individual sensor .
The sensors implement local classifiers which serve the purpose of filtering out unusable data, defined as the data that are not needed for maintaining the target accuracy at the edge processor. While the amount of data transferred from the sensors to the edge is bounded by the timevarying capacity of the channel, the efficacy in locally removing unnecessary data is bounded by the processing power and energy availability at the sensors. On the one hand, the transmission of unfiltered data may violate the bandwidth constraint, thus causing data loss and disruption of existing wireless services. On the other hand, running a complex local classifier may require excessive computational effort and energy expense to the mobile devices.
We formulate an optimization problem capturing the tension between these two extremes for the purposes of dynamic adaptation of filters deployed at the sensors. Based on the input from the sensors, the edge processor periodically produces a new filter with controlled complexity for each sensor, based on bandwidth and energy usage constraints following from highlevel operational objectives. Herein, we focus on building customized classifiers possessing the following characteristics:

Locality. The sensorspecific classifiers will be trained to achieve a certain accuracy level for the kinds of inputs the sensor is likely to receive. For instance, the local classifiers will be built to provide lowerror predictions for indoor images if the sensor is placed inside.

BandwidthAwareness. The local classifiers are designed to be used as bandwidthpreserving filters, thus optimizing for the falsenegative rate to meet the bandwidth constraints imposed by the link to the global edge processor.

Complexity and EnergyAwareness. The design of the local classifiers will satisfy complexity and energy requirements of the sensor as determined by a stochastic energyarrival process.
Given the complex, accurate classifier at the edge, our objective is to build a sensorspecific classifier tailored to the distribution of samples in the current period, and satisfying the bounded complexity and bandwidth usage. More formally, we are provided with a pretrained binary classifier, e.g., one detecting whether a person is visible by the sensor, denoted by , where is the space of possible inputs. We treat this classifier as a blackbox function in order to support as wide of a variety of machine learning algorithms as possible.
For a sensor during period , the goal is to identify a local classifier , , that meets the specifications of the sensor, where is the family of machine learning classifiers we want the sensor to use (for instance, linear classifiers). In particular, we are provided with the following requirements corresponding to the aforementioned characteristics:

Locality : The expected distribution of the sensor inputs for period is denoted by . We want to be as accurate as as possible on inputs from this distribution.

Bandwidth : The average amount of data allowed to be transmitted by for the period should be less than . (It is also possible to consider a generalization where only the total capacity for all sensors is provided.)

Energy : Average energy used by for the period should be less than .
In this work we assume that the customized classifier will be built on the edge, not the sensor, and thus the computational efficiency of estimating is not restricted.
Iii Semantic Compression
In this section, we outline our proposed approach to constructing a classifier that meets the sensor’s requirements on energy, bandwidth, and locality for the period , while still being faithful to the complex, global classifier .
Energy Efficiency
The primary obstacle with using at the sensor level is its computational complexity. For insance, each prediction by a neural network can often take hundreds to thousands of floatingpoint computations, resulting in a heavy power consumption. Instead, we are concerned with learning an energyefficient classifier , for being limited to a simpler model family, such as SVMs, decision trees, linear classifiers, etc. We define the energy consumed by for an input as ; the average energy used by the sensor for period will be . We also define a penalty on the classifier for violating an energy constraint as , such that if meets the energy requirement , and otherwise. Since directly estimating the energy consumption of a classifier is challenging, we use the number of computational operations as a proxy, and thus penalizes the more operations it requires for a prediction.
Locality
Obviously, an energyefficient classifier , by using a simpler structure, cannot have the same general representation capabilities as the global classifier for the complete range of inputs. However, in any given time period, most sensors do not receive the full variety of inputs that the global classifier is designed to support, and thus it is possible to have focus its representation on the inputs expected at the sensor. In order to identify such a , we use the expected distribution of inputs, , to compute how similar is to . In particular, given a loss function between ’s and ’s predictions on an instance , e.g., the squared loss or the logistic loss , we evaluate the similarity between and as . Fig. 3 illustrates the intuition, where a complex, powerconsuming global classifier (solid gray curve) can be approximated quite well locally by a simple, and thus energyefficient, classifier (dashed bold line).
Bandwidth Awareness
Every automated detector is accompanied by a certain level of expected error, often measured as the rate of false positives and false negatives. Due to the energy constraints on the desired classifier , it may not be able to maintain the same low error levels as the global classifier , even on the local distribution of inputs. In such situations, we can treat as the sensorlevel filtering of the inputs, with running at the edge level to achieve the same low error levels. Thus there is a tradeoff between how much the bandwidth is used to transmit false positives versus missing out a relevant input in order to conserve the bandwidth. We define the amount of data will use for an input as ; the average data transmitted by the sensor for period will be . We further define the penalty on the classifier for violating the bandwidth as , such that if uses less than bandwidth, and otherwise. Fig. 3 shows an example where a classifier that is not aware of its use as a filter (the leftmost example) may transmit less but have a high error rate, while a bandwidthaware classifier (in the middle) will obtain lower false negative rate.
Semantic Compression
From the sensor specifications, namely local distribution , energy consumption constraint and penalty function , bandwidth constraint and penalty function , and the global classifier , we can frame the search for the sensorspecific classifier as the following optimization problem to be solved periodically over time:
(1)  
s.t.  (2)  
(3) 
Here and have the meaning of the tolerances on the expected penalties and for random observations following a given locality distribution .
The distribution serves a proxy role conveying to the edge processor a local description of expected observations at the sensor, without wasting the bandwidth for transmitting the observations themselves. The edge processor, in turn, replies to the sensor with a classifier , locally tuned to according to the problem in Eq. (1)–(3). For each particular sensor and time period, the distribution is fixed, so the efficacy of this semantic compression scheme is determined by whether the family of local classifiers is flexible enough for the distribution of positive and negative samples in . However, at a larger scope, the locality may vary and is subject to negotiation between the sensor and the edge processor.
With the shape of the locality controllable, the quality of corresponding classifiers may be improved additionally through locality tuning. This brings the option to view the optimization in Eq. (1)–(3) as a subproblem for a higherlevel control task, maintaining a desired aptitude of classifiers on a sequence of observations generated by the sensor. This way, the problem of finding optimal local may be extended to the broader adaptivecontrol problem of maintaining a desired accuracy of filtration by adjusting the localitycapturing procedure delivering distributions , such that
(4) 
The penalty function stands for the losses we bear from any inadequacies of the local classifier to the particular choice of locality , which we would like to keep bounded by a tolerance . Here, the quality is monitored for inputs from some control distribution chosen by the edge processor using the empirical data arriving from the sensor and the a priori strategic objectives for the ultimate outcomes of the sensoredge system as a whole. In practice, may coincide with the global observatory distribution , the locality distribution , or can be derived from the sequence of empirical observations obtained by the sensor . In Eq. (4), the locality is made an argument of the penalty to highlight its potential role as the control “variable”. One simple example giving an idea of how localities may be parametrized and controlled will be given in the following section.
Iv Simulation Results
In order to illustrate the feasibility of the proposed approach, let us consider a motivating example of a binary classification problem, in the context of a single sensoredge pair (for this reason we omit the index below, for the sake of brevity).
As customary, input observations subject to classification come as feature vectors in a multidimensional vector space . The two classes correspond to the sets of observations that are to be registered by the sensoredge system (positives), versus the rest (negatives). In this case, probability distributions of both classes are set to be Gaussian mixtures (and so is, therefore, the joint distribution ). Both mixtures consist of the same number of symmetric normallydistributed components centered equidistantly on a number of lines parallel to the main diagonal of the unit hypercube.
For simplicity, we assume that both and belong to the same class of Support Vector Machine classifiers (SVMs) working in the space . To satisfy the requirement of having a lower complexity than , the class is limited to SVMs with linear kernels, while the reference global classifier is trained for the kernel of Gaussian radial basis functions (and can be replaced with even more computationallyintense classifier). Each locality distribution guiding the selection of training samples for the onsensor classifiers is set to be a uniform distribution in a sphere described by its center and radius . By the nature of the distribution , the local and global accuracy of the classifier is expected to not differ significantly, while the accuracy of localized classifiers shall be sensitive to the localities and their sizes .
In this circumstances, the applicability of the problem statements given in Section II to this detection task requires a study of two aspects of the system: (i) The accuracy of the localized classifiers for different spheres as a function of radii and the update frequency . (ii) Realization of actual distributions of consecutive observations in the data for a desired update frequency, and the procedure for adaptively choosing the radii reacting to the accuracycomplexity tradeoff.
To this end, both in this specific example and in general, we need to be in possession of two samples. First, a labeled training dataset of pairs is necessary, where points are drawn from the joint distribution of observations , and signify the corresponding labels. We can assume the availability of this sample without any loss of generality, as the very problem setting given in Section II starts with a classifier that has to be trained on some sample, which we can reuse here for . In the unsupervised case, for the purposes of the following discussion, labels can be defined by the outcomes of the global classifier .
Second, it is necessary to have a sample of one or more trajectories , representative of the sequential process generating observations on the sensor. In practice, this sample can be obtained from previous, nonadaptive runs of the sensoredge system in question, where all sensor observations eventually reach and get accumulated at the edge processor. In this example problem, we assume that the trajectory distribution follows the general distribution (which would also likely be the case in general, as well, unless the nature of observation process dictates otherwise). Adhering to this assumption, we generate a sample as a Markov chain starting from a randomly chosen point and continuing by applying the MetropolisHastings sampling algorithm to the distribution .
The two aforementioned aspects of the system, then, can be studied through the following duplex sampling procedure (schematically depicted in Fig. 4).

For each update frequency (or, equivalently, the length of the update period in the number of observations), draw a sample of subsequences of consecutive observations along the trajectory .

For each subsequence :

Find the minimal sphere containing all (or a given percentage of) points .

Sample points from the general training sample uniformly inside of the sphere .

Using the points in as a training sample, fit a classifier to a desired quality.

Apply the classifier to the points in the subsequence , comparing the verdicts of to the corresponding verdicts of the reference classifier for those same points in .

Store the radius of the sphere and the resulting accuracy of the localized classifier on the points in .

With the accumulated statistics of radii and accuracies , it is then possible for us to compute the empirical averages of both of these features over trajectory’s subsequences as functions of update period .
Fig. 5 and 6 demonstrate these functional relations in the case of our motivating example for a multidimensional Gaussian sample. The former figure depicts the average radius of spheres containing of the points in subsequences for different values of . As we can see, the average radius quickly grows as the update period increases. The latter figure highlights the opposite trend: the accuracy of locallyfit classifiers almost monotonically decreases with increasing period of updates. For comparison, the accuracy of the global classifier when it is implemented as an RBFkernel SVM fluctuates insignificantly around independent of the update frequency .
Here both and were trained to treat both false positive and false negatives equally; in cases where it is intolerable to miss detections due to localized approximation, the same trends will be present for respectively adjusted . The choice of update frequency can be guided by the penalty taken by the accuracy when the classifier trained for a locality is kept for a use in the subsequent localities without an update. For our example this relation is summarized in Fig. 7 showing the change in mean accuracy of a local classifier as a function of the delay between its training and its usage. The xaxis measures the delay relative to the update period length . The yaxis measures the ratio between the mean accuracy for the trajectory subsequence corresponding to the moment a local classifier is used and the mean accuracy for the trajectory subseqence corresponding to the moment locality was captured.
All three of these relations confirm the feasibility of the assumptions underlying the problem formulation, that, while simpler local classifiers have poor accuracy globally, their quality catches up for frequent locality updates to a satisfactory level comparable to that of the global classifier . The ultimate quality of the resulting system will, of course, depend significantly on the mutual compatibility of the data distribution (governing the complexity of the global classifier ), the family of local classifiers , the form of locality distributions , and the constraints on the desired accuracy. For instance, when sensor sampling trajectories do not exhibit enough compactness as measured by the form and , it might be problematic or even impossible to achieve very high levels of accuracy with the localized substitution classifiers . In each particular case, the limits of the achievable results should be studied separately, e.g., using the above trajectorysampling procedure.
For problems where, like in our example here, the locality of the space can be exploited well for a given and , it opens the possibility for an efficient adaptation of the locality as a function of some control parameters . For instance, here the update period can serve the role of the parameter , with the control objective consisting in keeping it smaller than some guaranteeing a desired accuracy (according to Fig. 6).
V Conclusions
Sophisticated IoT systems often involve combining sensing, communication, and processing capabilities. Recent architectures for such IoT systems often perform expensive computation at the edgelevel, in order for the mobile devices to utilize their limited energy for sensing and transmission. However, such architectures often cannot meet the tight constraints of a timevarying or limited bandwidth availability, as is common in real world applications, due to their need to communicate all of the data from the sensorlevel devices to the edge.
In this paper, we proposed an alternative architecture where the edge and the devices perform the computation cooperatively. The core of our proposed approach is to provide a “semantic” strategy for carrying out this sharing of the computation: we dynamically craft customized classifiers for each sensor that define what the sensor device will communicate to the edge processor, thus offloading majority of the computation to these devices. This proposed design of sensorspecific classifiers takes into account the various properties of the current context such as the sensorspecific distribution of inputs that the device is likely to observe, the energy resources and constraints on the device, and the timevarying limitations on the shared bandwidth to the edge.
We showed the feasibility of our semantic approach using simulated experiments. We demonstrated that simple, energyefficient classifiers can be as accurate in classification as complex classifiers if we utilize the distribution inputs that the sensing device is likely to receive when constructing them. We further showed that the approach is fairly robust to changes in this distribution of inputs over time. Although the classifiers need to be updated as the current context of the sensors and the edge changes over time, we also demonstrated that the sensorspecific classifiers still maintain accuracy even if they are not updated very frequently. With these encouraging results, we are interested in future to deploy such an architecture to realworld IoT testbeds.
References
 L. Atzori, A. Iera, and G. Morabito, “The Internet of Things: A survey,” Computer networks, vol. 54, no. 15, pp. 2787–2805, 2010.
 P. Neirotti, A. D. Marco, A. Cagliano, G. Mangano, and F. Scorrano, “Current trends in smart city initiatives: Some stylised facts,” Cities, vol. 38, pp. 25–36, 2014.
 C. T. Barba, M. A. Mateos, P. R. Soto, A. M. Mezher, and M. A. Igartua, “Smart city for vanets using warning messages, traffic statistics and intelligent traffic lights,” in Intelligent Vehicles Symposium (IV), 2012 IEEE. IEEE, 2012, pp. 902–907.
 F. J. Martinez, C.K. Toh, J.C. Cano, C. T. Calafate, and P. Manzoni, “Emergency services in future intelligent transportation systems based on vehicular communication networks,” IEEE Intelligent Transportation Systems Magazine, vol. 2, no. 2, pp. 6–20, 2010.
 F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and its role in the internet of things,” in Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, ser. MCC ’12, 2012, pp. 13–16.
 M. Satyanarayanan, P. Simoens, Y. Xiao, P. Pillai, Z. Chen, K. Ha, W. Hu, and B. Amos, “Edge analytics in the internet of things,” IEEE Pervasive Computing, vol. 14, no. 2, pp. 24–31, 2015.
 M. T. Ribeiro, S. Singh, and C. Guestrin, “"Why should I trust you?": Explaining the predictions of any classifier,” in Knowledge Discovery and Data Mining (KDD), 2016.
 ——, “Modelagnostic interpretability of machine learning,” in ICML Workshop on Human Interpretability in Machine Learning (WHI), 2016.