# User Preference Aware Lossless Data Compression at the Edge

###### Abstract

Data compression is an efficient technique to save data storage and transmission costs. However, traditional data compression methods always ignore the impact of user preferences on the statistical distributions of symbols transmitted over the links. Notice that the development of big data technologies and popularization of smart devices enable analyses on user preferences based on data collected from personal handsets. This paper presents a user preference aware lossless data compression method, termed edge source coding, to compress data at the network edge. An optimization problem is formulated to minimize the expected number of bits needed to represent a requested content item in edge source coding. For edge source coding under discrete user preferences, DCA (difference of convex functions programming algorithm) based and -means++ based algorithms are proposed to give codebook designs. For edge source coding under continuous user preferences, a sampling method is applied to give codebook designs. In addition, edge source coding is extended to the two-user case and codebooks are elaborately designed to utilize multicasting opportunities. Both theoretical analysis and simulations demonstrate the optimal codebook design should take into account user preferences.

## I Introduction

With the popularization of smart devices and the rise of mobile multimedia applications, network traffic has undergone an explosive growth in the past decades. By using fewer bits to encode information than the original representation, data compression is an efficient technique to reduce the cost of storing or transmitting data. Traditional data compression methods are always source-based. The major design criterion is the average compression ratio for content generated by the information source. In the era of 5G, a large number of various content items are generated at the network. Because each user might only be interested in a few kinds of content items, a source-based data compression algorithm probably cannot achieve a satisfactory compression ratio for these specific kinds of content items. Recently the development of big data technologies and popularization of smart devices enable us to analyze user preferences and predict user requests based on private data collected from personal handsets. It becomes possible to improve the efficiency of data compression according to user preferences.

Since C. E. Shannon published his famous source coding theorem [1], data compression has been widely studied and various data compression schemes have been proposed. The LZ77 algorithm was presented in [2], which is the basis of several ubiquitous compression schemes, including ZIP and GIF. To improve the compression ratio, lossy compression algorithms for various types of content have also been developed, such as MPEG-4 for videos [3] and JPEG for images [4]. To solve the scalability problem resulting from packet classification in network services, a lossy compression based classifier was designed to reduce the classifier size [5]. Recently considerable attention has been paid to data compression in sensor networks [6]-[8]. A lossy compression algorithm was proposed to cope with large volumes of data generated by meters in smart distribution systems [6]. To reduce the power consumption of wireless sensors in Internet of Things applications, [7] proposed a hybrid data compression scheme, in which both lossless and lossy techniques were used. For wireless sensor networks with correlated sparse sources, a complexity-constrained distributed variable-rate quantized compression sensing method was developed in [8]. Machine learning techniques were also applied in data compression and transmission [9]-[10]. A joint source-channel coding design of text was provided by deep learning in [9]. End-to-end communications were implemented by neural networks in [10].

With the coming of Internet of Things and 5G, the edge of networks is attracting more and more attentions [11], because it can handle the concerns of latency requirement, bandwidth saving, and data security [12]. Edge computing has the potential to support “smart city” [13], improve vehicle services [14], and implement task offloading [15]. As edge nodes are closer to users, caching in the edge is capable of reducing the delivery latency and network congestion. Various content delivery and caching schemes were developed for edge networks [16]-[22]. The energy efficiency of edge caching was revealed in [19]. In [20], a learning-based method was proposed for edge caching. Furthermore, many papers have focused on improve the performance of edge caching via user preferences [23]-[26]. In edge networks, user preferences also contribute to improve streaming video services [27], the Quality-of-Experience [28], wireless resource allocations [29], and device-to-device content deliveries [30].

In this paper, we are interested in data compression at the edge under user preferences. More specifically, we consider the situation in which users are located at the network edge and are connected to content providers through a service provider. In traditional communication systems, data compression processes are executed at content providers when content items are generated. In these compression processes, more common symbols should be assigned with shorter codewords to minimize the expected codeword length according to information theory [31]. However, because user interests vary with content items, a symbol that is common in the whole set of content items might not be common in the requested content items. In other words, the statistical distribution of symbols in the transmission link is not likely to be identical to that in the information source due to user preferences. As a result, it is necessary to re-compress content items at the service provider according to user preferences. In this paper, a user preference aware lossless data compression scheme, termed edge source coding, is proposed to compress data according to finitely many codebooks at the network edge.

To obtain the optimal codebooks, we formulate an optimization problem for edge source coding, which is however nonconvex for the general case. For edge source coding with a single codebook, we solve the optimization problem via the method of Lagrange multipliers. An optimality condition is further presented. For edge source coding under discrete user preferences, the optimization problem reduces to a clustering problem. DCA (difference of convex functions programming algorithm) based and -means++ based algorithms are proposed to give codebook designs [32]. For edge source coding under continuous user preferences, a sampling based method is applied to give codebook designs. We further extend edge source coding to the two-user case and present two algorithms to reduce transmission costs by using multicasting opportunities. Finally, simulation results are presented to demonstrate the potential of edge source coding and the performance of our algorithms.

The rest of the paper is organized as follows. Section II presents the content distribution system model and the concept of edge source coding. And then the formulation of edge source coding is given in Section III. Sections IV presents codebook designs for edge source coding. In Section V, edge source coding for two users with preference correlation is studied. Simulation results are presented in Section VI. Finally, Section VII concludes this paper and lists some directions for future research.

## Ii System Model

In this section, we present our system model, based on which the key idea of edge source coding is introduced.

### Ii-a Content Distribution System

Consider a content distribution system as shown in Fig. 1. A user is interested in content items generated by content providers (CPs) scattered around the Web. A base station (BS) serves as service provider to satisfy the user requests.^{1}^{1}1In this paper, the terms base station and service provider are used interchangeably. Assume each CP produces its content item by choosing symbols from the same alphabet and each content item consists of symbols. The symbols of a content item are generated independently and based on the same discrete distribution, denoted by , where denotes the probability that a symbol is . The vector is an attribute of that content item and will be referred to as the symbol probability vector (SPV). Note that a content item with SPV has entropy .^{2}^{2}2 denotes the entropy and represents the binary logarithm. All the feasible SPVs form a set .^{3}^{3}3For a positive integer , represents the set

In the content distribution system, the user issues requests for the content items. Let describe the interest of the user in the content item with SPV . Then is a probability density function with support set . The function will also be referred to as the user preference. To give further insights, we present a tripartite graph model as shown in Fig. 1. As stated before, each content item has its own probability distribution of the symbols, i.e., SPV. The symbol distributions of two content items can be totally different. We assume that each user has an individual preference that can be characterized by the request probability for various content items. In practice, the empirical request probability can be learned from a user’s historical requests. Further, we assume that a user has a fixed preference in our considered timescale. In other words, a non-ergodic preference selection is assumed to determine the preference of a user newly accessing to the service provider, or more particularly, a BS. In other words, the random symbols of the equivalent source at the edge are generated according to a simple probabilistic graphical model, as shown at the bottom of Fig. 1.

The BS responds to a user request and initiates an end-to-end transmission to satisfy it. To improve transmission efficiency, data compression techniques can be used to eliminate statistical redundancy of the original symbol sequences. Traditionally, data compression is executed only in application (APP) layers at the content providers. Content items are compressed according to the statistical distributions of symbols, i.e., their SPVs, and are associated with codebooks for decoding. Edge information including user preferences is always ignored in traditional data compression methods, which however have a significant impact on the statistical distributions of symbols transmitted at the edge, as illustrated in Fig. 1. To reduce the transmission cost at the network edge, we present a user preference aware lossless data compression method, termed edge source coding, to re-compress original symbol sequences at the service provider. As shown in Fig. 1, edge source coding exploits edge information in physical (PHY) layer transmissions.

### Ii-B Edge Source Coding

In edge source coding, the BS compresses the content items according to finitely many binary codebooks to satisfy the user requests. These binary codebooks are cached in both the BS and the user. In this way, the encoded symbol sequences do not need to be associated with a whole codebook for decoding. Let be the total number of codebooks used in edge source coding. In the -th codebook, the symbol is represented by bits. If the BS applies the -th codebook to encode a content item with SPV , this content item can be represented by bits, where and .^{4}^{4}4 denotes the Kullback-Leibler divergence. To satisfy the request for this content item, the BS only needs to transmit these bits. The user can decode the received bits by trying all the cached codebooks. In the rest of this paper, we also use to represent the -th codebook. According to Kraft’s inequality [31], should satisfy

(1) |

to ensure decodability.

As there are codebooks, the minimum cost to satisfy a request for the content item with SPV is given by . Because the SPV of the requested content item obeys a probability density function , the transmission cost for a requested content item is given by

(2) |

In this paper, we aim to design codebooks to minimize the transmission cost, i.e., expected number of bits to represent a requested content item.

## Iii User Preference Aware Compression: A Problem Formulation

In this section, we formulate an optimization problem for edge source coding and solve it for the case. In addition, an optimality condition for the optimization problem is presented.

As stated in Subsection II-B, the codebooks can be described by . Kraft’s inequality is a sufficient and necessary condition for the existence of a uniquely decodable code. We only require to obey Kraft’s inequality and relax the constraint that each component of should be a negative integer power of two. Note that Eq. (2) can be rewritten as

(3) |

To minimize the transmission cost, we only need to minimize in Eq. (3), which indicates the additional transmission cost per symbol under user preference due to the mismatch between SPVs and codebooks.

Let be the set of SPVs that have the smallest Kullback-Leibler divergence with codebook among all the codebooks, i.e.,

(4) |

One can see that the sets are disjoint, , and for .^{5}^{5}5If a point satisfies for then can be classified into and randomly to ensure that the sets are disjoint. To minimize the transmission cost per symbol, a content item with SPV belonging to should be encoded according to codebook .
The following optimization problem gives the optimal codebook design:

(5) |

In the objective function, the integral over is calculated by summing up the integrals over its subsets . Note that swapping the values of different does not change the objective value of problem (5). Thus, problem (5) has more than one optimal solutions, which further implies the nonconvexity of problem (5).

Because is a set depending on it is nontrivial to solve problem (5) for the general case. We first solve problem (5) for to give greater insight. In this case, there is only one codebook and thus we denote it as for simplicity. Theorem 1 presents the optimal codebook design.

###### Theorem 1.

For , the optimal codebook for edge source coding is given by ^{6}^{6}6 represents the expectation of a random variable.

(6) |

###### Proof.

Let us consider the Lagrange function

(7) |

where is the Lagrange multiplier and is an -dimensional vector all of whose components are equal to 1. The partial derivative of with respect to is given by

(8) |

By the optimality conditions and , the optimal codebook can be derived as Eq. (6). ∎

In the optimal codebook, the codeword of consists of bits. Eq. (6) implies that the more frequently the requested content items contain a symbol, the shorter the codeword corresponding to this symbol will be. To improve the data compression efficiency at the network edge, user preferences must be taken into account in designing the codebook.

Based on Theorem 1, the following theorem gives an optimality condition for any values of the number of codebooks, i.e., .

###### Theorem 2.

###### Proof.

Theorem 2 reveals the coupling between the optimal and . Note that the inequality can be expanded as

(10) |

which is a linear inequality in the vector . Then is a convex polytope characterized by hyperplanes defined by . The vector can be given by conditional expectations over To some extend, can be viewed as a central point in .

## Iv Codebook Design for User Preference Aware Compression

In this section, we investigate codebook design for edge source coding under user preference . If the user is interested only in finitely many content items, reduces to a probability mass function. Then is referred to as a discrete user preference. If is a continuous probability density function over , is referred to as a continuous user preference. Codebook designs for both discrete and continuous user preferences are considered in this section. It will be shown that codebook designs rely on user preferences in both the two cases.

### Iv-a DCA based Codebook Design under Discrete User Preferences

In this subsection, edge source coding under discrete user preferences is studied. Mathematically, is nonzero only at finite points. As a result, the edge source coding problem becomes a clustering problem. A DCA (difference of convex functions programming algorithm) based algorithm is proposed to give a codebook design.

Denote the set of nonzero points of as . We have and . Let be the probability that the content item with SPV is requested. Then, we have and . In this case, problem (5) becomes the following form:

(11) |

where

(12) |

Similarly, a content item with SPV in should be encoded by codebook in order to minimize the transmission cost.

Problem (11) can be viewed as a clustering problem. The points in are clustered into subsets. In the discrete case, Eq. (9) becomes

(13) |

Again, Eq. (13) implies that the best codebook design should take into account the discrete user preferences. Once the optimal clustering scheme is obtained, the optimal codebooks can be derived from Eq. (13). To solve problem (11) is to find the optimal clustering scheme. However, there are around clustering schemes. It is computationally prohibitive to traverse all the possible ones. To give a suboptimal clustering scheme within affordable space and time costs, we transform problem (11) into a DC (difference of convex functions) programming problem and present a DCA based method to give an appropriate clustering scheme. The core idea behind the construction of the DC programming is probabilistic clustering in which the codebook used to encode each content item is randomly selected.

###### Lemma 1.

Problem (11) is equivalent to the following problem:

(14) |

###### Proof.

The key to prove the equivalence between two optimization problems is to show that the optimal solution of one can be easily derived from the optimal solution of the other and vice-versa. We first show the transformation from the optimal solution of problem (11) to that of problem (14). Let us consider a probabilistic clustering scheme. For the content item with SPV , let be the probability that codebook is selected to encode it. Then problem (11) can be rewritten as

(15) |

where the constraint Eq. () corresponds to the probability normalization.

The method of Lagrange multipliers can be used to simplify problem (15). Let us consider the Lagrange function

(16) |

The partial derivative of with respect to is given by

(17) |

Again, we have

(18) |

according to the optimality conditions and . It can be seen that Eqs. (18) and (13) are very similar. If we impose each equal to 0 or 1, Eq. (18) reduces to Eq. (13). That is because the deterministic clustering is a special case of probabilistic clustering. Substituting Eq. (18) into problem (15) yields problem (14).

Suppose is the optimal solution of problem (11). We set

(19) |

Then and form the optimal solution of problem (15) and Eq. (18) holds for and . As a result, is the optimal solution of problem (14). The transformation from from the optimal solution of problem (14) to that of problem (11) is presented in Appendix. ∎

Note that represents that the probability that the content item with SPV is encoded by codebook . Lemma 1 indicates the best probabilistic clustering is exactly a deterministic clustering. Although the constraints of problem (14) are linear, problem (14) is still intractable due to the nonconvex objective function. Notice the objective function of problem (14) contains the logarithms of fractions. We can transform the original problem (11) into a DC programming problem and apply DCA to solve it.

###### Theorem 3.

Problem (11) is equivalent to a DC programming problem having the following form:

(20) |

where , for , for other values of , for , and for other values of .

###### Proof.

Note that is weighted average of SPVs. The weights are related to the probabilities in the probabilistic clustering scheme. In problem (20), the linear equality constraint results from Eqs. (), (21), and (22). Denote and , which are two 0-1 vectors. Algorithm 1 provides a codebook design based on DCA for problem (20). Considering Lemma 1 implies that each is 0-1 in the optimal solution, Algorithm 1 resets the values of in Steps 4-7 after obtaining from Algorithm 2. Algorithm 2 provides a suboptimal solution for problem (20) by solving the following convex problem iteratively:

(23) |

In each iteration, is calculated in Steps 6-8. It can be seen that is the gradient of function at the point . Then is a local approximation of function ( is a constant). This is the reason we use to replace in problem (23). It should be pointed out that it is easy to initialize satisfying . We only need to initialize satisfying Eqs. () and () and then generate and according to Eqs. (21) and (22).

### Iv-B -means++ based Codebook Design under Discrete User Preferences

Recall that the edge source coding problem reduces to a clustering problem under discrete user preferences. Considering that -means++ is a typical heuristic algorithm for clustering problems [32], we present a codebook design for edge source coding under discrete user preferences based on the -means++ approach in this subsection, as detailed in Algorithm 1. In Steps 1-5, each codebook is initialized according to the probabilities of the SPVs and the Kullback-Leibler divergence with codebooks having been determined. In Steps 6-10, a variant of the -means approach is employed to update the clustering centers . Compared with the traditional -means++ approach, Algorithm 3 uses the Kullback-Leibler divergence instead of Euclidean distance to reassign the points into different clusters. In addition, the center of a cluster is derived from some conditional expectations in Algorithm 3, which is usually not the arithmetic mean of points in this cluster.

### Iv-C -means++ based Codebook Design under Continuous User Preferences

In this subsection, we extend the -means++ based algorithm proposed in the previous subsection to the continuous user preferences. Based on the coupling relationship between and revealed in Eqs. (4) and (9), Algorithm 4 presents an iterative method to obtain a suboptimal solution of problem (5), which is a continuous version of Algorithm 3. In Algorithm 4, the codebooks are also initialized based on user preferences. The parameter in Step 7 is a sufficiently small positive number. Instead of summation, Algorithm 4 calculates several integrals to update . The convergence of Algorithm 4 is guaranteed by the fact that each update of or achieves a lower objective value of problem (5). Thus, Algorithm 4 at least reaches a locally optimal point.

We illustrate the iterative process in Algorithm 4 by the case , , and the user preferences are uniform, i.e., . In this case, the set is formed by points in an equilateral triangle with vertices and , as shown in Fig. 2. The codebooks and are two three-dimensional vectors, corresponding to and respectively after the -th iteration. Because there are only two codebooks, we need just a single hyperplane to split . The hyperplane is a line in the case, denoted as . Note that is not the perpendicular bisector of but instead is given by the equality . In the -th iteration, the points corresponding to the two codebooks are updated to and , which happen to be the centroids of the triangle and quadrilateral . When the user preferences are uniform, can be obtained by

(24) |

can also be obtained by a short calculation [33]. After obtaining and , the new split line can be calculated and further and are updated.

In the general case, namely, is an arbitrary probability density function and , it is intractable to calculate integrals over . This is because is a convex polytope bounded by high dimensional hyperplanes. As a result, Algorithm 4 is of high computational complexity for arbitrary and large . More specifically, the integrals over sets induce the majority of the computational complexity. In the following subsection, we overcome this by a sampling method.

### Iv-D Sampling based Codebook Design under Continuous User Preferences

Algorithm 5 presents a sampling based method to give a codebook design for edge source coding under continuous user preferences. The core idea behind Algorithm 5 is that integrals over a set with a probability measure can be estimated by summations over sample points. In Algorithm 5, is the number of sample points. These points are sampled according to the function . In Step 2, every sample point is associated with an identical probability . Step 3 calls Algorithm 3 or Algorithm 1 to gives a suboptimal codebook design.

The number of sample points is a key parameter in Algorithm 5. On the one hand, too large will result in a high computing time. On the other hand, too small will reduce the estimation accuracy of the sampling method. There are two versions of Algorithm 5 according to the algorithm called in Step 3. Simulations will demonstrate that these two versions of Algorithm 5 have similar performance.

## V Edge Source Coding for Two Users with Common Interests

In this section, we extend edge source coding to the two-user case, which will be referred to as user 1 and user 2. In contrast with the scenario considered in previous sections, two-user edge source coding is capable of reducing the total transmission cost by taking advantage of multicasting opportunities. The preferences of the two users will be described by a matrix.

Recall that represents the set of feasible SPVs. Let and denote the SPVs of the content items requested by user 1 and user 2, respectively. Then and are two random variables with support set . We denote the probability density function of and as . Let and be the total numbers of codebooks used by user 1 and user 2, respectively. As the two users probably request the same content item, we assume there are codebooks that the two users have in common, denoted by . In other words, user 1 and user 2 have and exclusive codebooks, respectively. Let and denote the sets of exclusive codebooks for user 1 and user 2, respectively.

In the two-user edge source coding, multicasting opportunities can be utilized when the two users request the same content item. In this case, the requested content item will be encoded by a common codebook of the ones and then be transmitted to the two users simultaneously. If the two users request different content items, the BS has to compress the content item for each user separately. The total transmission cost to satisfy the user requests is given by

(25) |

The minimum operations in Eq. (25) imply that each content item is encoded by the codebook with the smallest Kullback-Leibler divergence.
The first term in Eq. (25) corresponds to the transmission cost when the two users’ requests are identical and multicasting technique is used. The second and third terms correspond to the transmission costs of user 1 and user 2 when the two users’ requests are different.
In this section, we aim to minimize Eq. (25) by elaborately designing and codebooks . It should be noted that if the probability that the two users request the same content item is zero, i.e., , the first term in Eq. (25) becomes 0.^{7}^{7}7 denotes the probability of an event. Then multicasting opportunities cannot be created. The optimal coding scheme should be setting and designing codebooks for the two users separately, which reduces to the problem considered in the previous sections. Thus, we pay attention to the situation that in this section.

To make sure , cannot be a continuous probability density function, because otherwise the Lebasgue measure of the set is 0. As discussed in Subsection IV-A, we consider to be a probability mass function. More specifically, user 1 and user 2 are interested only in different content items with SPVs . The support of reduces to , which allows us to describe the user preferences by a matrix where . The matrix will be referred to as the joint preference matrix. The trace of reflects the two users’ common interests. Hence, our task becomes minimizing Eq. (25) given and . To this end, we formulate an optimization problem as follows:

(26) |

which is a mixed integer programming (MIP) with linear constraints. The objective of problem (26) is derived from Eq. (25) by removing a constant factor and a constant addend. Suffering from the nonconvex objective, problem (26) is intractable. Two low-complexity algorithms are proposed to give codebook designs.

### V-a DCA based Codebook Design for Edge Source Coding with Two Users

In this subsection, a DCA based algorithm is proposed to give a codebook design for edge source coding with two users. The major obstacle to solve problem (26) results from the minimum operations, which impose that each content item is encoded by the codebook bringing the smallest transmission cost. Again, we consider a probabilistic clustering scheme, in which content items are encoded by randomly chosen codebooks.

Let be the probability that codebook is used when the two users request the content item with SPV simultaneously. In the case only user 1 requests the content item with SPV , we let and be the probabilities that codebooks and are used, respectively. Similarly, we define and . In the probability clustering scheme, problem (26) becomes

(27) |

Similar to the analysis in Subsection IV-A, problem (27) is equivalent to problem (26). The probabilistic clustering method eliminates the minimum operations without any loss in the optimality.

If we fix , problem (27) becomes an optimization problem without integer variables. Then, the method of Lagrange multipliers can help us get greater insight on problem (27). We have the theorem stated below.

###### Theorem 4.

The optimal probabilistic clustering satisfies

(28) | |||||

(29) | |||||