# Online Learning Algorithms for Stochastic Water-Filling

## Abstract

Water-filling is the term for the classic solution to the problem of
allocating constrained power to a set of parallel channels to
maximize the total data-rate. It is used widely in practice, for
example, for power allocation to sub-carriers in multi-user OFDM
systems such as WiMax. The classic water-filling algorithm is
deterministic and requires perfect knowledge of the channel gain to
noise ratios. In this paper we consider how to do power allocation
over stochastically time-varying (i.i.d.) channels with unknown gain
to noise ratio distributions. We adopt an online learning framework
based on stochastic multi-armed bandits. We consider two variations
of the problem, one in which the goal is to find a power allocation
to maximize , and another
in which the goal is to find a power allocation to maximize
. For the first problem,
we propose a *cognitive water-filling* algorithm that we call
CWF1. We show that CWF1 obtains a regret (defined as the cumulative
gap over time between the sum-rate obtained by a distribution-aware
genie and this policy) that grows polynomially in the number of
channels and logarithmically in time, implying that it
asymptotically achieves the optimal time-averaged rate that can be
obtained when the gain distributions are known. For the second
problem, we present an algorithm called CWF2, which is, to our
knowledge, the first algorithm in the literature on stochastic
multi-armed bandits to exploit non-linear dependencies between the
arms. We prove that the number of times CWF2 picks the incorrect
power allocation is bounded by a function that is polynomial in the
number of channels and logarithmic in time, implying that its
frequency of incorrect allocation tends to zero.

## 1 Introduction

A fundamental resource allocation problem that arises in many settings in communication networks is to allocate a constrained amount of power across many parallel channels in order to maximize the sum-rate. Assuming that the power-rate function for each channel is proportional to as per the Shannon’s capacity theorem for AWGN channels, it is well known that the optimal power allocation can be determined by a water-filling strategy [Cover:1991]. The classic water-filling solution is a deterministic algorithm, and requires perfect knowledge of all channel gain to noise ratios.

In practice, however, channel gain-to-noise ratios are stochastic
quantities. To handle this randomness, we consider an alternative
approach, based on online learning, specifically stochastic
multi-armed bandits. We formulate the problem of stochastic
water-filling as follows: time is discretized into slots; each
channel’s gain-to-noise ratio is modeled as an i.i.d. random
variable with an unknown distribution. In our general formulation,
the power-to-rate function for each channel is allowed to be any
sub-additive function ^{1}

In the classical multi-armed bandit, there is a player playing arms that yield stochastic rewards with unknown means at each time in i.i.d. fashion over time. The player seeks a policy to maximize its total expected reward over time. The performance metric of interest in such problems is regret, defined as the cumulative difference in expected reward between a model-aware genie and that obtained by the given learning policy. And it is of interest to show that the regret grows sub-linearly with time so that the time-averaged regret asymptotically goes to zero, implying that the time-averaged reward of the model-aware genie is obtained asymptotically by the learning policy.

We show that it is possible to map the problem of stochastic water-filling to an MAB formulation by treating each possible power allocation as an arm (we consider discrete power levels in this paper; if there are possible power levels for each of channels, there would be total arms.) We present a novel combinatorial policy for this problem that we call CWF1, that yields regret growing polynomially in and logarithmically over time. Despite the exponential growing set of arms, the CWF1 observes and maintains information for variables, one corresponding to each power-level and channel, and exploits linear dependencies between the arms based on these variables.

Typically, the way the randomness in the channel gain to noise ratios is dealt with is that the mean channel gain to noise ratios are estimated first based on averaging a finite set of training observations and then the estimated gains are used in a deterministic water-filling procedure. Essentially this approach tries to identify the power allocation that maximizes a pseudo-sum-rate, which is determined based on the power-rate equation applied to the mean channel gain-to-noise ratios (i.e., an optimization of the form ). We also present a different stochastic water-filling algorithm that we call CWF2, which learns to do this in an online fashion. This algorithm observes and maintains information for variables, one corresponding to each channel, and exploits non-linear dependencies between the arms based on these variables. To our knowledge, CWF2 is the first MAB algorithm to exploit non-linear dependencies between the arms. We show that the number of times CWF2 plays a non-optimal combination of powers is uniformly bounded by a function that is logarithmic in time. Under some restrictive conditions, CWF2 may also solve the first problem more efficiently.

## 2 Related Work

The classic water-filling strategy is described in
[Cover:1991]. There are a few other stochastic variations of
water-filling that have been covered in the literature that are
different in spirit from our formulation. When a fading distribution
over the gains is known *a priori*, the power constraint is
expressed over time, and *the instantaneous gains are also
known*, then a deterministic joint frequency-time water-filling
strategy can be used [GoldsmithVaraiya, GoldsmithBook].
In [Wang:2010], a stochastic gradient approach based on
Lagrange duality is proposed to solve this problem when the fading
distribution is unknown but still instantaneous gains are available.
By contrast, in our work we do not assume that the instantaneous
gains are known, and focus on keeping the same power constraint at
each time while considering unknown gain distributions.

Another work [Zaidi:2005] considers water-filling over stochastic non-stationary fading channels, and proposes an adaptive learning algorithm that tracks the time-varying optimal power allocation by incorporating a forgetting factor. However, the focus of their algorithm is on minimizing the maximum mean squared error assuming imperfect channel estimates, and they prove only that their algorithm would converge in a stationary setting. Although their algorithm can be viewed as a learning mechanism, they do not treat stochastic water-filling from the perspective of multi-armed bandits, which is a novel contribution of our work. In our work, we focus on stationary setting with perfect channel estimates, but prove stronger results, showing that our learning algorithm not only converges to the optimal allocation, it does so with sub-linear regret.

There has been a long line of work on stochastic multi-armed bandits involving playing arms yielding stochastically time varying rewards with unknown distributions. Several authors [Gai:LLR, Anantharam, Agrawal:1995, Auer:2002] present learning policies that yield regret growing logarithmically over time (asymptotically, in the case of [Gai:LLR, Anantharam, Agrawal:1995] and uniformly over time in the case of [Auer:2002]). Our algorithms build on the UCB1 algorithm proposed in [Auer:2002] but make significant modifications to handle the combinatorial nature of the arms in this problem. CWF1 has some commonalities with the LLR algorithm we recently developed for a completely different problem, that of stochastic combinatorial bipartite matching for channel allocation [Gai:2010], but is modified to account for the non-linear power-rate function in this paper. Other recent work on stochastic MAB has considered decentralized settings [Anandkumar:Infocom:2010, Anandkumar:JSAC, Liu:zhao:2010, Gai:decentralized:globecom], and non-i.i.d. reward processes [Tekin:2010, Tekin:restless:infocom, Qing:ita, Dai:icassp, Gai:rested:globecom]. With respect to this literature, the problem setting for stochastic water-filling is novel in that it involves a non-linear function of the action and unknown variables. In particular, as far as we are aware, our CWF2 policy is the first to exploit the non-linear dependencies between arms to provably improve the regret performance.

## 3 Problem Formulation

We define the stochastic version of the classic communication theory problem of power allocation for maximizing rate over parallel channels (water-filling) as follows.

We consider a system with channels, where the channel gain-to-noise ratios are unknown random processes . Time is slotted and indexed by . We assume that evolves as an i.i.d. random process over time (i.e., we consider block fading), with the only restriction that its distribution has a finite support. Without loss of generality, we normalize . We do not require that be independent across . This random process is assumed to have a mean that is unknown to the users. We denote the set of all these means by .

At each decision period (also referred to interchangeably as a time slot), an -dimensional action vector , representing a power allocation on these channels, is selected under a policy . We assume that the power levels are discrete, and we can put any constraint on the selections of power allocations such that they are from a finite set (i.e., the maximum total power constraint, or an upper bound on the maximum allowed power per subcarrier). We assume for all . When a particular power allocation is selected, the channel gain-to-noise ratios corresponding to nonzero components of are revealed, i.e., the value of is observed for all such that . We denote by the index set of all for an allocation .

We adopt a general formulation for water-filling, where the sum rate
^{2}

(1) |

where for all , is a nonlinear continuous increasing sub-additive function in , and for any . We assume is defined on .

Our formulation is general enough to include as a special case of the rate function obtained from Shannon’s capacity theorem for AWGN, which is widely used in communication networks:

In the typical formulation there is a total power constraint and individual power constraints, the corresponding constraint is

where is the total power constraint and is the maximum allowed power per channel.

Our goal is to maximize the expected sum-rate when the distributions of all are unknown, as shown in (2). We refer to this objective as .

(2) |

Note that even when have known distributions, this is a hard combinatorial non-linear stochastic optimization problem. In our setting, with unknown distributions, we can formulate this as a multi-armed bandit problem, where each power allocation is an arm and the reward function is in a combinatorial non-linear form. The optimal arms are the ones with the largest expected reward, denoted as . For the rest of the paper, we use as the index indicating that a parameter is for an optimal arm. If more than one optimal arm exists, refers to any one of them.

We note that for the combinatorial multi-armed bandit problem with linear rewards where the reward function is defined by , is a solution to a deterministic optimization problem because . Different from the combinatorial multi-armed bandit problem with linear rewards, here is a solution to a stochastic optimization problem, i.e.,

(3) |

We evaluate policies for with respect to
*regret*, which is defined as the difference between the
expected reward that could be obtained by a genie that can pick an
optimal arm at each time, and that obtained by the given policy.
Note that minimizing the regret is equivalent to maximizing the
expected rewards. Regret can be expressed as:

(4) |

where , the expected reward of an optimal arm.

Intuitively, we would like the regret to be as small as possible. If it is sub-linear with respect to time , the time-averaged regret will tend to zero and the maximum possible time-averaged reward can be achieved. Note that the number of arms can be exponential in the number of unknown random variables .

We also note that for the stochastic version of the water-filling problems, a typical way in practice to deal with the unknown randomness is to estimate the mean channel gain to noise ratios first and then find the optimized allocation based on the mean values. This approach tries to identify the power allocation that maximizes the power-rate equation applied to the mean channel gain-to-noise ratios. We refer to maximizing this as the sum-pseudo-rate over averaged channels. We denote this objective by , as shown in (5).

(5) |

We would also like to develop an online learning policy for . Note that the optimal arm of is a solution to a deterministic optimization problem. So, we evaluate the policies for with respect to the expected total number of times that a non-optimal power allocation is selected. We denote by the number of times that a power allocation is picked up to time . We denote . Let denote the total number of times that a policy select a power allocation . Denote by the indicator function which is equal to if is selected under policy at time , and 0 else. Then

(6) | ||||

## 4 Online Learning for Maximizing the Sum-Rate

We first present in this section an online learning policy for stochastic water-filling under object .

### 4.1 Policy Design

A straightforward, naive way to solve this problem is to use the UCB1 policy proposed [Auer:2002]. For UCB1, each power allocation is treated as an arm, and the arm that maximizes will be selected at each time slot, where is the mean observed reward on arm , and is the number of times that arm has been played. This approach essentially ignores the underlying dependencies across the different arms, and requires storage that is linear in the number of arms and yields regret growing linearly with the number of arms. Since there can be an exponential number of arms, the UCB1 algorithm performs poorly on this problem.

We note that for combinatorial optimization problems with linear reward functions, an online learning algorithm LLR has been proposed in [Gai:LLR] as an efficient solution. LLR stores the mean of observed values for every underlying unknown random variable, as well as the number of times each has been observed. So the storage of LLR is linear in the number of unknown random variables, and the analysis in [Gai:LLR] shows LLR achieves a regret that grows logarithmically in time, and polynomially in the number of unknown parameters.

However, the challenge with stochastic water-filling with objective , where the expectation is outside the non-linear reward function, directly storing the mean observations of will not work.

To deal with this challenge, we propose to store the information for each combination, i.e., , , we define a new set of random variables . So now the number of random variables is , where . Note that .

For this redefined MAB problem with unknown random variables and linear reward function (7), we propose the following online learning policy CWF1 for stochastic water-filling as shown in Algorithm 1.

(8) |

To have a tighter bound of regret, different from the LLR algorithm, instead of storing the number of times that each unknown random variables has been observed, we use a by vector, denoted as , to store the number of times that has been observed up to the current time slot.

We use a by vector, denoted as to store the information based on the observed values. is updated in as shown in line 12. Each time an arm is played, , the observed value of is obtained. For every observed value of , values are updated: , the average value of all the values of up to the current time slot is updated. CWF1 policy requires storage linear in .

### 4.2 Analysis of regret

### Footnotes

- A function is subadditive if ; for any concave function , if (such as ), is subadditive.
- We refer to rate and reward interchangeably in this paper.