# Deep Learning for Distributed Optimization: Applications to Wireless Resource Management

###### Abstract

This paper studies a deep learning (DL) framework to solve distributed non-convex constrained optimizations in wireless networks where multiple computing nodes, interconnected via backhaul links, desire to determine an efficient assignment of their states based on local observations. Two different configurations are considered: First, an infinite-capacity backhaul enables nodes to communicate in a lossless way, thereby obtaining the solution by centralized computations. Second, a practical finite-capacity backhaul leads to the deployment of distributed solvers equipped along with quantizers for communication through capacity-limited backhaul. The distributed nature and the non-convexity of the optimizations render the identification of the solution unwieldy. To handle them, deep neural networks (DNNs) are introduced to approximate an unknown computation for the solution accurately. In consequence, the original problems are transformed to training tasks of the DNNs subject to non-convex constraints where existing DL libraries fail to extend straightforwardly. A constrained training strategy is developed based on the primal-dual method. For distributed implementation, a novel binarization technique at the output layer is developed for quantization at each node. Our proposed distributed DL framework is examined in various network configurations of wireless resource management. Numerical results verify the effectiveness of our proposed approach over existing optimization techniques.

## I Introduction

A non-convex optimization has been a fundamental challenge in designing wireless networks owing to the distributed computation nature and the limited cooperation capability among wireless nodes. For several decades, there have been significant efforts in non-convex programming as well as its distributed implementation [2, 3, 4, 5, 6]. Although the convergence and the optimality of these techniques have been rigorously examined, the underlying assumptions on network configurations and special mathematical structure of the objective and constraint functions are mostly ideal. For instance, a successive approximation framework [2] for a non-convex problem requires a proper convex approximation, which usually lacks a generic design technique. Moreover, distributed optimization techniques, such as the dual decomposition, the alternating direction method of multipliers (ADMM) [4], and the message-passing (MP) algorithm [5], are suitable only for separable objective and constraint functions, and the exchange of continuous-valued messages. Furthermore, the assumption of ideal cooperation among the nodes can be challenging in practical wireless networks, where any coordination among wireless nodes are usually limited by bandwidth and power constraints.

### I-a Related Works

To overcome the drawbacks of traditional optimization techniques, deep learning (DL) frameworks have been recently investigated in wireless resource management [7, 8, 9, 10, 11, 12] and end-to-end communication system design [13, 14, 15, 16, 17]. In particular, “learning to optimize” approaches in [7, 8, 9, 10, 11, 12] have received considerable attention for their potential to replace traditional optimization algorithms with neural network computations. Power control problems are addressed in multi-user interference channels (IFCs) via deep neural networks (DNNs) to maximize the sum rate performance [7]. A supervised learning technique is used to learn a locally optimal solution produced by the weighted minimum-mean-square-error (WMMSE) algorithm [18]. Its real-time computational complexity is shown to be much smaller than the original WMMSE, albeit involving intensive computations with numerous samples in training. However, the supervised learning task needs the generation of numerous training labels, i.e., the power control solution of the WMMSE algorithm, which would be a bottleneck for the DNN training step. In addition, the average sum rate performance achieved by the DNN in [7] is normally lower than the WMMSE algorithm due to the nature of the supervised learning framework.

Unsupervised learning strategies, which do not require a label set in training, have been recently applied for power control solutions in the IFC [8, 9], cognitive radio (CR) networks [10], and device-to-device communication systems [11, 12]. A recent work studies a generic resource allocation formulation in wireless network and develops a model-free training strategy [19]. DNNs are trained to maximize the network utility directly, e.g., the sum rate performance, rather than to memorize training labels. Unlike the supervised counterpart developed in [7], training labels are not necessary. In [8], the sum rate maximization over the IFC is pursued by the use of a convolutional neural network (CNN) which accepts channel state matrices of all users as DNN inputs. The designed CNN improves a locally optimal WMMSE algorithm with much reduced computational complexity. Constrained power control problems in a CR network are addressed using DNN so that unlicensed transmitters satisfy interference temperature (IT) constraint for a licensed receiver [10, 11]. However, training DNNs with the IT constraint is not straightforward since existing DL optimizers are not suitable for handling constrained problems. To resolve this issue, a penalizing strategy is applied to transform the original training task to an unconstrained training one by augmenting a penalty term associated with the IT constraint into the network utility function [10, 11]. Nevertheless, the feasibility of the trained DNN is, in general, not guaranteed since it is sensitive to the choice of penalty parameters. Additional optimization of penalty parameters is required for fine-tuning the contribution of the penalizing function in the DNN cost function. The identification of such hyper parameters incurs a computationally expensive search in typical DL applications. Therefore, the method would not be able to obtain the solution of general constrained optimizations in practical wireless communication systems.

Most studies on the DL approach to wireless network optimization require global network information, such as perfect knowledge of channel state information (CSI) at other nodes, and thus the centralized computation process is essential for the operation of the DNN [7, 10, 11, 12, 19]. It is not practical due to the limited cooperation capacity among nodes. A simple distributed implementation is investigated in [8] for the IFC. Unknown other cell CSI inputs are replaced with zeros in testing the DNN, which is trained in a centralized manner using the global CSI of all transmitters. Due to its heuristic way of the operation and the lack of the other cell CSI knowledge, this simple distributed DNN technique shows a noticeable performance degradation as compared to existing optimization algorithms [8]. An on-off power control problem in the IFC is addressed with individual transmitters equipped with their own DNN for the distributed computation in [9]. To enhance the distributed computational capability, the DNN is constructed to accept the estimated CSI for other cells and to yield the on-off transmission strategy of each transmitter. A dedicated DNN is distributed to an individual transmitter after combined training of multiple DNNs. However, it lacks the optimization of the CSI estimation and exchange process, which is a key feature of the distributed network construction. Furthermore, the constrained optimization cannot be handled by the method developed in [8] and [9]. Therefore, the distributed DL-based design for wireless networks remains open.

### I-B Contributions and Organization

This paper investigates a DL-based approach that solves generic non-convex constrained optimization problems in distributed wireless networks. The developed technique establishes a general framework which can include the works in [7, 8, 9, 10, 11, 12] as special cases. Compared to a supervised learning technique [7], the proposed DL framework is based on an unsupervised learning strategy that does not require training labels, i.e., the predetermined optimal solutions. There are multiple nodes which desire to minimize the overall network cost by optimizing their networking strategy subject to several constraints. An individual node observes only local measurement, e.g., the local CSI, and quantizes to forward it to neighbors through capacity-limited backhaul links. The quantized information of other nodes can further improve the distributed computing capability of individual nodes. To achieve the network cost minimization, a distributed solver as well as a quantizer of the local information are necessary. This involves a highly complicated formulation with non-convex binary constraints.

To handle this issue, a DNN-based optimization framework is proposed for two different cases according to the capacity of the backhaul links. An ideal infinite-capacity backhaul is taken into account where lossless data sharing is allowed among the nodes. In this configuration, individual nodes can determine their solutions by themselves using the perfect global information collected from other nodes. Thus, a single central DNN unit is designed to produce the overall network solution by collecting the global information as an aggregated input. However, state-of-the-art DL algorithms are mostly intended for unconstrained problems and lack the feature to include constraints in the DNN training. The Lagrange dual formulation is employed to accommodate constrained DNN training problems where the strong duality [6] is verified. To train the DNN efficiently under generic network constraints, this work bridges a gap between the primal-dual method in traditional optimization theory and the stochastic gradient descent (SGD) algorithm in the DL technique [20]. In consequence, a constrained training strategy is developed such that it performs iterative updates of the DNN and the dual variables via state-of-the-art SGD algorithms. Unlike unconstrained DL works in [7, 8, 9, 10, 11, 12], the proposed constrained training algorithm ensures to produce an efficient feasible solution for arbitrarily given constraints.

In a more realistic setup of a finite-capacity backhaul, a distributed deployment of the DL framework is addressed. To be specific, an individual node is equipped with two different DNN units: a quantizer unit and an optimizer unit. The quantizer unit generates a quantized version of the local information so that a capacity-limited backhaul afford the communication cost for the transfer of the data to a neighboring node. Meanwhile, the optimizer unit evaluates a distributed solution of an individual node based on the local information along with the quantized data received from other nodes. This distributed DL approach involves additional binary constraints for the quantization unit, resulting in so-called a vanishing gradient issue during the DNN training step. Such an impairment has been dealt with by a stochastic operation and a gradient estimation in image processing [21, 22]. For the extension of these studies, we apply a binarization technique of the quantizer unit to address a combinatorial optimization problem with binary constraints. An unbiased gradient estimation of a stochastic binarization layer is developed for the quantizer unit to generate a bipolar message. As a result, the quantizer and optimizer units of all nodes can be jointly trained via the proposed constrained training algorithm. Then, the real-time computation of the trained DNN units can be implemented in a distributed manner, otherwise not applicable for the DL techniques in [7, 10, 11, 12, 19], since they all require the perfect knowledge of other cells’ CSI. Finally, the proposed DL framework is verified from several numerical examples in cognitive multiple access channel (C-MAC) and IFC applications.

This paper is organized as follows: Section II explains the system model for generic distributed wireless networks. The DL framework for the centralized approach is proposed in Section III, and the distributed DNN implementation is presented in Section IV. In Section V, application scenarios of the proposed DL framework are examined and its numerical results are provided. Finally, Section VI conclude the paper.

Notations: We employ uppercase boldface letters, lowercase boldface letters, and normal letters for matrices, vectors, and scalar quantities, respectively. A set of -by- real-valued matrices and a set of length- bipolar symbol vectors are represented as and , respectively. Let denote the expectation over random variable . In addition, and account for the all zero and all one column vectors of length , respectively. In the sequel, we use subscripts , , and to denote quantities regarding centralized, distributed, and quantization operations, respectively. Finally, and stand for the gradient and the subgradient of with respect to evaluated at , respectively.

## Ii General Network Setup and Formulation

Fig. 1 illustrates a generic distributed network where nodes desire to minimize the overall system cost in a distributed manner. Node observes its own local information vector of length (e.g., channel coefficient) and computes the optimal state (or the solution) vector of length (e.g., the resource allocation strategy). The nodes are interconnected among one another via the backhaul realized by wired or wireless links so that the local observation at node is shared with neighboring node (). The capacity of the backhaul link from node to node is limited by in bits/sec/Hz. In practice, a wireless backhaul link is implemented with a reliable control channel orthogonal to wireless access links and dedicated for sharing the control information over the network. Thus, the capacity is fixed and known in advance. The corresponding configuration is formulated in a cost minimization as

(P): | ||||

(1) | ||||

(2) |

where and stand for the collection of the local information and the solution of all nodes, respectively. In addition, is a network cost function for given configuration . Here, (1) indicates a long-term design constraint expressed by an inequality with the upper bound . A class of a long-term design constraint includes the average transmit power budget and the average quality-of-service constraints at users evaluated over fast fading channel coefficients [23, 24]. A convex set in (2) denotes the set of feasible assignments for node . The cost function as well as constraint are assumed to be differentiable but are possibly non-convex.

We desire to optimize the network performance averaged over the observation subject to both long-term and instantaneous regulations. The long-term performance metrics are important in optimizing a wireless network under a fast fading environment where the quality of communication service is typically measured by the average performance, e.g., the ergodic capacity and the average power consumption [25]. Special cases of (P) have been widely studied in various network configurations such as multi-antenna multi-user systems [23], CR networks [24], wireless power transfer communications [26], and proactive eavesdropping applications [27]. Although these studies have successfully addressed the global optimality in their formulations, no generalized framework has not been developed for solving (P) with the guaranteed optimality. Furthermore, existing distributed optimization techniques, such as ADMM and MP algorithms, typically rely on iterative computations which may require sufficiently large capacity backhaul links for exchanging high-precision continuous-valued messages among nodes. This work proposes a DL framework for solving (P) in two individual network configurations. One ideal case is first addressed where backhaul links among the nodes have infinite capacity. The other case corresponds to a more realistic scenario with finite-capacity backhaul links where full cooperation among nodes is not possible.

### Ii-a Formulation

In the infinite-capacity backhaul link case, node forwards its local observation to other nodes without loss of the information. Thus, the global observation vector is available to all nodes. With the perfect global information at hand, each node can directly obtain the global network solution of (P) from a centralized computation rule given by

(3) |

i.e., the solution is a function of global information since a node exploits all available information for computing it. The associated optimization (P) is rewritten into (P1) as

(P1): | |||

where is a column vector with ones from -th entry to -th entry and the remaining entries equal to zero for masking the state associated with node from the global solution.

In the finite-capacity backhaul case with , node first discretizes its local measurement to obtain a bipolar quantization , which is transferred to node . Without loss of generality, it is assumed that is an integer.^{1}^{1}1If is not integer, -bit quantization is applied. The quantization at node is represented by

(4) |

where and . By collecting the information of obtained locally and transferred from neighbors, node calculates its state via a certain computation rule as

(5) |

where is the concatenation of and . Since (5) only requires the knowledge of the local observation and the distributed computation, it is referred to as a distributed approach. The distributed realization consists in jointly designing bipolar quantization (4) and distributed optimization (5) for all nodes. The optimization (P) can be refined as

(P2): | ||||

(6) |

Note that the binary constraint in (6) incurs a computationally demanding computation of search for efficient quantization associated with sample . Thus, solving (P2) by traditional optimization techniques is not straightforward in general.

### Ii-B Learning to Optimize

To address constrained optimization problems (P1) and (P2), a DL framework that identifies unknown functions in (3)-(5) is developed. The basics of a feedforward DNN with fully-connected layers are introduced briefly. Let be a DNN with hidden layers that maps input vector to output with a set of parameters . Subsequently, denotes -dimensional output of layer , for , and can be expressed by

(7) |

where and account for the weight matrix and the bias vector at layer , respectively. An element-wise function is the activation at layer . The set of parameters is defined by . The input-output relationship is specified by the successive evaluation of layers in (7). The training step determines the DNN parameter such that the cost function of the DNN is minimized. The activation of the DNN computation in (7) typically involves non-convex operations, and thus a closed-form expression is not available for the optimal parameter set . State-of-the-art DL libraries employ a gradient decent (GD) method and its variants for stochastic optimization [20]. The details of the proposed DNN construction and the corresponding training methods are described in the following sections.

DL techniques have been intensively investigated to solve optimization challenges in wireless networks such as power control and scheduling problems over the IFC [8, 9, 7], CR networks [10], and device-to-device communications [11, 12]. However, the long-term constraint (1) has not been properly considered in the DNN training. In addition, the existing techniques are mostly intended for a centralized solution, which is not necessarily feasible in practice. Several heuristics for the distributed realization have been provided for the IFC in [8] and [9]. Those techniques, however, lack an optimized quantization, which is a crucial feature of the distributed network for communication cost saving. It is still not straightforward to obtain efficient solutions of (P1) and (P2) with the existing DL techniques. In the following sections, a constrained training algorithm is first developed for solving (P1) with the long-term constraints based on the primal-dual method [28, 6]. Subsequently, a binarization operation addresses the binary constraint (6) in the distributed approach.

## Iii Centralized Approach

This section develops a DL framework that solves the constrained optimization in (P1) with the DNN that evaluates centralized computation (3). To this end, unknown function is replaced by a DNN as

(8) |

The following proposition assesses the quality of an approximation for by the DNN in (8) based on the universal approximation theorem [29], which investigates the existence of the DNN with an arbitrary worst-case approximation error over the training set .

###### Proposition 1.

Let be a continuous function defined over a compact set . Suppose that the DNN with hidden layers are constructed by sigmoid activations. Then, for any , there exists a sufficiently large number such that

(9) |

Proposition 1, which is based on the universal approximation theorem [29], implies that for a given set , there exist a set parameter such that the associated DNN of the structure in (8) can approximate any continuous function with arbitrary small positive error . This holds both for supervised and unsupervised learning problems, regardless of the convexity of cost and constraint functions. Therefore, the unknown optimal computation process for the non-convex problem (P1) is successfully characterized by a well-designed DNN . Proposition 1 only states the existence of satisfying (9) rather than a specific identification method. To determine , the DNN is trained such that it yields an efficient solution for (P1).

###### Remark 1.

Recently, it is revealed in [30] that, for any and Lebesgue-integrable function , i.e., , there exists a fully-connected DNN with rectified linear unit (ReLU) activations that satisfy The width of , i.e., the maximum number of nodes in hidden layers is bounded by , and the upperbound for the number of hidden layers is given by . From (4), it can be concluded that the DNN can act as an universal approximator for a non-continuous and Lebesgue-integrable function, e.g., an indicator function. This allows a DNN to learn discrete mappings in wireless resource management including the quantization operation in (4).

Next, we construct the DNN (8) for efficiently solving (P1) under the deterministic constraint . Suppose that the DNN consists of hidden layers and parameter set . The dimension, the activation functions, and the number of hidden layers of the DNN are hyper parameters to be optimized via a validation. By contrast, the dimension of the output layer is fixed to for obtaining the solution . To satisfy the masked constraint , the activation is set to a projection operation onto the convex set

(10) |

where is the output vector of hidden layer of the DNN . Since a convex projection can be realized with linear operations, the gradient of the activation in (10) that solves the convex projection problem is well-defined [31], and the backpropagation works well with this activation. The feasibility of such a projection layer has been verified in [15]. Plugging (8) and (10) into (P1) leads to

(P1.1): | ||||

(11) |

where the constraint is lifted by the convex projection activation (10). In (P1.1), the optimization variable is now characterized by the DNN parameter set , which can be more efficiently handled by state-of-the-art DL libraries including TensorFlow as compared to the direct determination of unknown function . One major challenge in (P1.1) features the integration of constraints in (11) into the training task of the DNN parameter set . It is not straightforward to address this issue by existing DL training algorithms which originally focus on unconstrained training applications. Recent DL applications in wireless communications mostly resort to a penalizing technique which transforms to an unconstrained training task by augmenting an appropritate regularization [10, 11, 16]. However, the feasibility of the trained DNN is not ensured analytically with an arbitrary choice of a hyperparameter on the regularization. Therefore, the hyperparameter is carefully adjusted via the evaluation of the validation performance during the DL training. The hyperparameter optimization is typically carried out through a trial-and-error search which often incurs computationally demanding calculations.

To overcome this issue, a Lagrange duality method [6] is applied to include the constraints of (P1.1) to the training step. The non-convexity of cost and constraint functions with respect to may result in a positive duality gap. Nevertheless, the following proposition verifies that (P1.1) fulfills the time-sharing condition [32], which guarantees the zero duality gap.

###### Proposition 2.

Let and be the minimum and the maximum achievable upper bound of the constraint (11), respectively. Suppose that there exists an arbitrary small number . Consider two distinct achievable upper bounds and such that . We define and as the optimal solution of (P1.1) with the upper bound being replaced by and , , respectively. Then, for arbitrary constant , we can find a feasible DNN parameter set which satisfies

(12) | ||||

(13) |

###### Proof:

The proof proceeds similarly as in [32, Theorem 2]. Let be a parameter corresponding to the time-sharing point between and , i.e., by setting for fraction and for the remaining fraction [32]. It is obvious that the equality in (12) holds for such a configuration and the constraints in (13) are also satisfied. As a result, it is concluded that (P1.1) fulfils the time-sharing condition. ∎

Proposition 2 addresses the existence of a feasible satisfying the time sharing condition in (12) and (13), for , . Let be the optimal value of (P1.1) with a nontrivial upper bound , . Then, Slater’s condition holds for (P1.1), i.e., for any with arbitrary small , there is a strictly feasible such that , . The analysis in [32] and [33], which combines the time sharing property in Proposition 2 with Slater’s condition, indicates that the optimal objective value is convex in nontrivial regime and ensures the strong duality for the non-convex problem (P1.1) [6]. Therefore, the Lagrange duality method can be employed to solve (P1.1). The Lagrangian of (P1.1) is formulated as

where a nonnegative corresponds to the dual variable associated with each constraint in (11), and their collection is denoted by . The dual function is then defined by

and the dual problem is written as

(14) |

To solve the problem in (14), the primal-dual method is employed to perform iterative updates between primal variable and dual variable [28].^{2}^{2}2A recent work studies a primal-dual method for the constrained training technique based on independent analysis for the strong duality [19]. However, the algorithm in [19] requires additional gradient computation to update auxiliary variables. For the convenience of the representation, let denote the value of evaluated at the -th iteration of the update. The primal update for can be calculated by the GD method as

(15) |

where a positive represents the learning rate. The gradient for a function in is given by the chain rule as

The dual update for of the dual problem (14) is determined by utilizing the projected subgradient method [28] as

(16) |

Proposition 1 leads to a DNN structure which can compute the dual variable via the update rule in (16). Based on (15) and (16), the DNN parameter and the dual solution can be jointly optimized by a single GD training computation. To implement the expectations (15) and (16) in practice, we adopt the mini-batch SGD algorithm, which is a powerful stochastic optimization tool for DL [20], to exploit parallel computing capability of general-purpose graphical processing units (GPUs). At each iteration, mini-batch set of size is either sampled from training set or generated from the probability distribution of global observation vector , if available. The update rules in (15) and (16) are replaced by

(17) | ||||

(18) |

Note in (17) and (18) that average cost and constraint functions are approximated with a sample mean over mini-batch set . The number of mini-batch samples is chosen to a sufficiently large number for accurate primal and dual updates. Its impact is discussed via numerical results in Section V.

Algorithm 1 summarizes the overall constrained training algorithm. Unlike the supervised learning method in [7] which trains the DNN to memorize given labels obtained by solving the optimization explicitly, the proposed algorithm employs an unsupervised learning strategy where the DNN learns how to identify an efficient solution to (P1.1) without any prior knowledge. In addition to the DNN parameter , the dual variables in (14) are optimized together, and thus the primal feasibility for (11) is always guaranteed upon convergence [28, 6]. Once the DNN is trained from Algorithm 1, the parameter is configured at nodes. Subsequently, node computes its solution using with the collection of local observations received from other node . The performance of the trained DNN is subsequently evaluated with testing samples which are unseen during the training.

## Iv Distributed Approach

The DL method proposed in Section III relies on the perfect knowledge of observations from other nodes, which is realized by the infinite-capacity backhaul links among nodes. This section presents a DL based distributed approach for (P2) by jointly optimizing the quantization (4) and the distributed optimization (5). Similar to the centralized approach, DNNs produce and that approximate (4) and (5) with , respectively. The universal approximation property in Proposition 1 secures the accurate description of the optimal solution and for (P2) with DNNs and , respectively. As illustrated in Fig. 2(a), a node is equipped with two different DNNs, i.e., the quantizer unit and the distributed optimizer unit . This results in the joint training of total of DNNs for distributed implementation. For notational convenience, the collection of the optimizer units from all nodes is denoted by , where and . Subsequently, the distributed optimization task (P2) can be transformed as

(P2.1): | ||||

(19) | ||||

(20) |

where the deterministic convex constraint is lifted by the convex projection activation in (10) at the output layer of each optimizer unit . Note that it is very difficult to enforce the non-convex binary constraint in (20) with a projection activation, such as signum function , since a well-known vanishing gradient issue occurs in the DNN training step [20]. To see this more carefully, can be chosen for the activation at the output layer of the quantizer DNN to yield binary output vector . However, it has a null gradient for entire input range. The quantizer DNN parameter is not trained at all by the SGD algorithm. Thus, such a deterministic binarization activation does not allow to train the DNNs effectively by using existing DL libraries.

### Iv-a Stochastic binarization layer

Fig. 2(b) presents the proposed stochastic binarization layer that enables a SGD based training algorithm while satisfying the binary constraint in (20). For binarization, hyperbolic tangent function is the activation at the output layer of the quantizer DNN to restrict the output within a -dimensional square box, i.e., . Since the output is still a continuous-valued vector, the final bipolar output is produced via a stochastic operation on continuous-valued vector as

(21) |

where the -th element of is a random variable given by

(22) |

where corresponds to the -th element of and determines the probability that variable is mapped to one at the output. Note that (21) and (22) guarantee a bipolar output . The stochastic operation in (21) can be regarded as a binarization operation of the unquantized vector , and corresponds to the quantization noise. In (22), any function of nonzero gradient can be chosen for a candidate of . This work employs an affine function for leading to the mean of noise equal to zero as

(23) |

where denotes the element-wise multiplication. The vanishing mean in (23) facilitates the back propagation implementation of the stochastic binarization layer (21).

It still remains unaddressed to find the gradient of the stochastic binarization in (21), i.e., . No closed-form formula is available due to stochastic noise . To resolve this, a gradient estimation idea [21, 22] is adopted such that the gradient of the stochastic operation in (21) is approximated by its expectation averaged over the quantization noise distribution, i.e., . The gradient is derived from (23) as

(24) |

Here, the zero-mean quantization noise replaces the gradient of with a deterministic gradient of the unquantized vector , which corresponds to the output of the hyperbolic tangent activation with a non-vanishing gradient. In addition, (24) reveals that the bipolar vector is an unbiased estimator of the unquantized vector , i.e., . Note from (21) and (24) that, during the training step, the operation of the stochastic binarization layer is different for the forward pass and the backward pass. In the forward pass where the DNN calculates the output from the input and, subsequently, the cost function, the binarization layer of the quantizer unit passes actual quantization vector in (21) to the optimizer unit, so that the exact values of the cost and constraint functions in (P2.1) are evaluated with the binary constraint (20). On the other hand, in the backward pass, the unquantized value of the binary vector is delivered from the output layer to the input layer of , since the gradient of is replaced with the gradient of its expectation in (24). As depicted in Fig. 2(b), the backward pass can be simply implemented with the hyperbolic tangent activation at the output layer.

### Iv-B Centralized Training and Distributed Implementation

The joint training strategy of optimizer and quantizer units is investigated here. The stochastic binarization layer properly handles the constraint in (20) of (P2.1) without loss of the optimality. Since the time sharing condition in Proposition 2 still holds for (P2.1), the primal-dual method in Section III is applied to train DNNs for (P2.1). At the -th iteration of the SGD training algorithm, the mini-batch updates for the DNN parameter and the dual variable associated with (19) can be respectively written by

(25) | ||||

(26) |

where

for a function . By (25) and (26), Algorithm 1 can be also used in the joint training task of (P2.1). Note that the training of (P2.1) is an offline task performed in a centralized manner. The real-time computations at node are carried out by the individual DNNs and by means of the trained parameters stored in memory units. Therefore, node can calculate its solution in a distributed manner, when operating. In practice, a backhaul link is realized by a reliable control channel, and the set of candidates for is determined by service providers. In this setup, multiple DNNs are trained corresponding to candidates of in advance. Then, this allows to each wireless node to select a suitable DNN from its memory unit for given backhaul capacity .

###### Remark 2.

One can consider a distributed training strategy for (P2.1) where DNNs deployed in individual wireless nodes are trained in a decentralized manner by exchanging relevant information for the training. To this end, individual DNNs forward their computation results with training samples, such as the cost function and the gradients of the weights and biases, both to the preceding and the proceeding layers. This results in a huge communication burden at each training iteration. Therefore, the proposed distributed deployment of jointly trained DNNs is practical for wireless nodes interconnected via capacity-limited backhaul links.

## V Applications to Wireless Resource Allocation

In this section, numerical results are presented for demonstrating the efficacy of the proposed DL approaches to solve (P) in various wireless network applications. We consider two networking configurations: the C-MAC and the IFC systems. Such configurations are considered to be key enablers in next-generation communication networks such as LTE unlicensed and multi-cell systems. For each configuration, the power control is managed by the proposed DL strategies. In the C-MAC system, the optimality of the proposed DL methods, which is established analytically in Propositions 1 and 2, is verified numerically by investigating the optimal solution provided in [24] both for primal and dual domains. Next, the sum-rate maximization in the IFC, known to be non-convex, is addressed. The performance of the trained DNNs is tested and compared with a locally optimal solution obtained by the WMMSE algorithm [18]. Finally, the proposed DL framework tackles the maximization of the minimum-rate performance in the IFC system. These results prove the viability of the DL technique in handling the optimization of non-smooth objective and constraints without additional reformulation.

### V-a Application Scenarios

#### V-A1 Cognitive multiple access channels

A multi-user uplink CR network [24] is considered. A group of secondary users (SUs), which share time and frequency resources with a licensed primary user (PU), desire to transmit their data simultaneously to a secondary base station (SBS). It is assumed that the PU, the SBS, and the SUs are all equipped with only a single antenna. Let and be the channel gain from SU to the SBS and the PU, respectively. If the transmit power at SU is denoted by , the average sum capacity maximization problem can be formulated as [24]

(P3): | ||||

(27) | ||||

(28) | ||||

(29) |

where , , and stand for the long-term transmit power budget at SU and the IT constraint for the PU, respectively. SU is responsible for computing the optimal transmit power . The optimal solution for (P3) is investigated in [24] using the Lagrange duality formulation and the ellipsoid algorithm. A centralized computation is necessary for the optimal solution in [24] under the assumption that the global channel gains and are perfectly known to SU .

Note that (P3) is regarded as a special case of (P) by setting and with . In the centralized approach, SU forwards the local information to neighboring SU . The deterministic constraint in (29) can be simply handled by the ReLU activation at the output layer, which yields a projection onto a non-negative feasible space with . The overall network solution is obtained using . For the distributed implementation, SU first obtains the quantization for transfer to other SUs and, in turn, calculates its solution using .

#### V-A2 Interference channels

In an IFC scenario, there are transmitters which send their respective messages to the corresponding receivers at the same time over the same frequency band. The channel state information between a pair of transmitter and receiver is denoted by . We consider two different power control challenges: the sum rate maximization and the minimum rate maximization. First, the sum rate maximization is formulated as

(P4): | ||||

(30) | ||||

(31) |

With a slight abuse of notations, and can be defined as the collection of the channel gains and the transmit power at transmitter , respectively. Two different transmit power constraints are taken into account in (P4): the average power constraint in (30) and the peak power constraint in (31). It is known in [7] that (P4) is non-convex and NP hard.

For a special case of , the average power constraint in (30) is ignored. In such a case, a local optimal solution for (P4) is obtained by the WMMSE algorithm [18]. The DL techniques have been recently applied to solve (P4) for the special case [7, 8]. Those methods, however, depend on the global CSI , thereby involving centralized computations for obtaining the power control solution. A naive distributed scheme has been proposed in [8] where zero input vector is applied to the trained DNN for the unknown information, while its performance is not sufficient in practice. Furthermore, in the case of , there is no generic algorithm for (P4) even with centralized computations.

Next, the minimum rate maximization is expressed as

(P5): | |||

This, in general, turns out to be a non-convex and non-smooth optimization problem. If , the globally optimal solution can be obtained from an iterative algorithm in [34]. However, it requires to share the power control solution of all other cell transmitters until convergence, resulting in the use of infinite backhaul links. Furthermore, in a general case of , no efficient solution for (P5) is available. In optimization theory, additional transformation, such as epigraph form [6] and Perron-Frobenius theory [34], are necessary to reformulate an analytically intractable objective function of (P5). Rigorous analytical processing of equivalent reformulations for such a purpose typically result in additional auxiliary optimization variables, i.e., the increase in the dimension of a solution space. Proper address of (P5) with the proposed DL approaches shows the power of the DNNs for tackling non-smooth optimization problems without traditional reformulation techniques.

The proposed DL framework is applied to solve (P4) and (P5). Transmitter now becomes a computing node that yields its power control strategy from the local CSI .^{3}^{3}3The channel gain can be estimated at receiver based on standard channel estimation processes. Then, receiver can inform to its corresponding transmitter through reliable uplink feedback channels [8]. Similar to the C-MAC problem (P3), both the centralized and the decentralized DL approaches are readily extended to (P4) and (P5).

### V-B Numerical Results

We test the proposed DL framework for solving the power control problems (P3)-(P5). The details of the implementation and simulation setup are described first. In the centralized approach, the DNN is constructed with 4 hidden layers, each of the output dimension . In the distributed approach, the optimizer DNN consists of 3 hidden layers, each of the output dimension , while the quantizer DNN is a single hidden layer neural network of the dimension . The ReLU activation is employed for all hidden layers. Since, in both approaches, each node is equipped with 4 hidden layers of the same ReLU activation, they are implemented with comparable computation complexities at each node. For efficient training, the batch normalization technique is applied at each layer [35].^{4}^{4}4Although a more sophisticated structure of the DNN for the applications given in Section V-A can be constructed, the optimization of the DNN structure is out of scope. In the proposed training strategy, the updates of the DNN parameters and the dual variables are carried out by the state-of-the-art SGD algorithm based on Adam optimizer [36].
The learning rate and the mini-batch size are and , respectively. At each iteration, mini-batch samples, i.e., the channel gains, are independently generated from the exponential distribution with unit mean.^{5}^{5}5The proposed DL approaches are examined to work well in practical distance-based channel models, and similar trends are observed as in a simple Rayleigh fading setup. The number of iterations for the Adam optimizer is set to , and thus the total of samples are applied during the training. The DNN parameters are initialized according to the Xavier initialization [37]. The initial weight matrices are randomly generated from zero-mean Gaussian random variable matrices with the variance normalized by the input dimension, while all elements of the initial bias vectors are fixed to . The initial value of each dual variable is . During the training, the performance of the DNNs is assessed over the validation set of size samples randomly generated from training mini-batch sets. Finally, the performance of the trained DNN is examined over the testing set with samples which are not known to the DNNs during the training and the validation. The proposed training algorithms are implemented in Python 3.6 with TensorFlow 1.4.0.