Online Constrained Optimization over Time Varying Renewal Systems: An Empirical Method
Xiaohan Wei
Department of Electrical Engineering, University of Southern California
xiaohanw@usc.edu
Michael J. Neely
Department of Electrical Engineering, University of Southern California
mjneely@usc.edu
This paper considers constrained optimization over a renewal system. A controller observes a random event at the beginning of each renewal frame and then chooses an action that affects the duration of the frame, the amount of resources used, and a penalty metric. The goal is to make framewise decisions so as to minimize the time average penalty subject to time average resource constraints. This problem has applications to task processing and communication in data networks, as well as to certain classes of Markov decision problems. We formulate the problem as a dynamic fractional program and propose an online algorithm which adopts an empirical accumulation as a feedback parameter. Prior work considers a ratio method that needs statistical knowledge of the random events. A key feature of the proposed algorithm is that it does not require knowledge of the statistics of the random events. We prove the algorithm satisfies the desired constraints and achieves near optimality with probability 1.
Key words: renewal system, Markov decision processes, stochastic optimization, online optimization
MSC2000 subject classification: 90C15, 90C30, 90C40, 93E35
History:
Consider a system that operates over the timeline of real numbers . The timeline is divided into backtoback periods called renewal frames and the start of each frame is called a renewal (see Fig. 1). The system state is refreshed at each renewal. At the start of each renewal frame the controller observes a random event and then takes an action from an action set . The pair affects: (i) the duration of that renewal frame; (ii) a vector of resource expenditures for that frame; (iii) a penalty incurred on that frame. The goal is to choose actions over time to minimize time average penalty subject to time average constraints on the resources.
This problem has applications to task processing and file downloading in computer networks, and to certain classes of Markov decision problems.

Task processing: Consider a device that processes computational tasks backtoback. Each renewal period corresponds to the time required to complete a single task. The random event observed for task corresponds to a vector of task parameters, including the type, size, and resource requirements for that particular task. The action set consists of different processing mode options, and the specific action determines the processing time, energy expenditure, and task quality. In this case, task quality can be defined as a negative penalty, and the goal is to maximize time average quality subject to power constraints and task completion rate constraints.

File downloading: Consider a wireless device that repeatedly downloads files. The device has two states: active (wants to download a file) and idle (does not want to download a file). Here, denotes the observed wireless channel state, which affects the success probability of downloading a file (and thereby affects the transition probability from active to idle). This example is discussed further in the simulation section (Section id1).

Markov decision problems: Consider a discrete time Markov decision problem over an infinite horizon and with constraints on average cost per slot (see [1] and [2] for exposition of Markov decision theory). Assume there is a special state that is recurrent under any sequence of actions (similar assumptions are used in [2]). Renewals are defined as revisitation times to that state. A random event is observed upon each revisitation and affects the Markov properties for all timeslots of that frame. This can be viewed as a twotimescale Markov decision problem [3][4]. In this case, the control action chosen on frame is in fact a policy for making decisions over slots until the next renewal. The algorithm of the current paper does not require knowledge of the statistics of , but does require knowledge of the conditional Markov transition probabilities (given ) over each frame .
In the special case when the set of all possible random events is finite, the action set is finite, and the probabilities of are known, the problem can be solved (offline) by finding a solution to a linear fractional program (see also [1] for basic renewalreward theory). Methods for solving linear fractional programs are in [5][6][7]. The current paper seeks an online method that does not require statistical knowledge and that can handle possibly infinite random event sets and action sets.
Prior work in [8] considers the same renewal optimization problem as the current paper and solves it via a driftpluspenalty ratio method. This method requires an action to be chosen that minimizes a ratio of expectations. However, this choice requires knowledge of the statistics of . Methods for approximate minimization of the ratio are considered in [8]. A heuristic algorithm is also proposed in [8] that is easier to implement because it does not require knowledge of statistics. That algorithm is partially analyzed: It is shown that if the process converges, then it converges to a nearoptimal point. However, whether or not it converges is unknown.
The current paper develops a new algorithm that is easy to implement, requires neither statistics of nor explicit estimation of them, and is fully analyzed with convergence properties that provably hold with probability 1. In particular, the feasibility of the algorithm is justified through a stability analysis of virtual queues and the near optimality comes from a novel construction of exponential supermartingales. Simulation experiments on a time varying constrained MDP conform with theoretical results.
The renewal system problem considered in this paper is a generalization of stochastic optimization over fixed time slots. Problems are categorized based on whether or not the random event is observed before the decisions are made. Cases where the random event is observed are useful in network optimization problems where maxweight ([9][10]), Lyapunov optimization ([11][12][13][14]), fluid model methods ([15][16]), and dual subgradient methods ([17][18][19]) are often used.
Cases where the random events are not observed are called online convex optimization. Various algorithms are developed for unconstrained learning including (but not limited to) weighted majority algorithm ([20]), multiplicative weighting algorithm ([21]), following the perturbed leader ([22]) and online gradient descent ([23][24]). The resource constrained learning problem is studied in [25] and [26]. Moreover, online learning with underlying MDP structure is also treated using modified multiplicative weighting ([27]) and improved following the perturbed leader ([28]).
Consider a system where the time line is divided into backtoback time periods called frames. At the beginning of frame (), a controller observes the realization of a random variable , which is an i.i.d. copy of a random variable taking values in a compact set with distribution function unknown to the controller. Then, after observing the random event, the controller chooses an action vector . Then, the tuple induces the following random variables:

The penalty received during frame : .

The length of frame : .

A vector of resource consumptions during frame : .
We assume that given and at frame , is a random vector independent of the outcomes of previous frames, with known expectations. We then denote these conditional expectations as
which are all deterministic functions of and . This notation is useful when we want to highlight the action we choose. The analysis assumes a single action in response to the observed at each frame. Nevertheless, an ergodic MDP can fit into this model by defining the action as a selection of a policy to implement over that frame so that the corresponding , and are expectations over the frame under the chosen policy.
Let
The goal is to minimize the time average penalty subject to constraints on resource consumptions. Specifically, we aim to solve the following fractional programming problem:
(1)  
s.t.  (2)  
(3) 
where are nonnegative constants, and both the minimum and constraint are taken in an almost sure sense. Finally, we use to denote the minimum that can be achieved by solving above optimization problem. For simplicity of notations, let
(4) 
Our main result requires the following assumptions, their importance will become clear as we proceed. We begin with the following boundedness assumption:
Assumption 1 (Exponential type)
Given and for a fixed , it holds that with probability 1 and are of exponential type, i.e. there exists a constant s.t.
where is a positive constant.
The following proposition is a simple consequence of the above assumption:
Proposition 1
Suppose Assumption 1 holds. Let be any of the three random variables , and for a fixed . Then, given any and ,
The proof is easy by expanding the term using Taylor series and bounding , using the first and the second order term, respectively.
Assumption 2
There exists a positive constant large enough so that the optimal objective of , denoted as , falls into with probability 1.
Remark 1
If , then, we shall find a constant large enough so that . Then, define a new penalty . It is then easy to see that minimizing is equivalent to minimizing and the optimal objective of the new problem is , which is nonnegative.
Assumption 3
Let be the performance vector under a certain pair. Then, for any fixed , the set of achievable performance vectors over all is compact.
In order to state the next assumption, we need the notion of randomized stationary policy. We start with the definition:
Definition 1 (Randomized stationary policy)
A randomized stationary policy is an algorithm that at the beginning of each frame , after observing the realization , the controller chooses with a conditional probability which depends only on .
Assumption 4 (Bounded achievable region)
Let
be the oneshot average of one randomized stationary policy. Let be the set of all achievable oneshot averages . Then, is bounded.
Assumption 5 (slackness)
There exists a randomized stationary policy such that the following holds,
where is a constant.
Remark 2 (Measurability issue)
We implicitly assume the policies for choosing in reaction to result in a measurable , so that , , are valid random variables and the expectations in Assumption 4 and 5 are well defined. This assumption is mild. For example, when the sets and are finite, it holds for any randomized stationary policy. More generally, if and are measurable subsets of some separable metric spaces, this holds whenever the conditional probability in Definition 1 is “regular” (see [33] for the exposition of regular conditional probability and related topics), and , , are continuous functions on .
We define a vector of virtual queues which are 0 at and updated as follows:
(5) 
The intuition behind this virtual queue idea is that if the algorithm can stabilize , then the “arrival rate” is below “service rate” and the constraint is satisfied. The proposed algorithm then proceeds as in Algorithm 1 via two fixed parameters , , and an additional process that is initialized to be . For any real number , the notation stands for ceil and floor function:
Note that we can rewrite (6) as the following deterministic form:
Thus, Algorithm 1 proceeds by observing on each frame and then choosing in to minimize the above deterministic function. We can now see that we only use knowledge of current realization , not statistics of . Also, the compactness assumption (Assumption 3) guarantees that the minimum of (6) is always achievable. For the rest of the paper, we introduce several abbreviations:
The intuitive reason why we need trimmed pseudo average is to ensure the empirical accumulation does not blow up, which plays an important role in the analysis.

At the beginning of each frame , the controller observes , , and chooses action to minimize the following function:
(6) 
Update :

Update virtual queues :
In this section, we prove that the proposed algorithm gives a sequence of actions which satisfies all desired constraints with probability 1. Specifically, we show that all virtual queues are stable with probability 1, in which we leverage an important lemma from [30] to obtain a exponential bound for the norm of .
The start of our proof uses the driftpluspenalty methodology. For a general introduction on this topic, see [29] for more details. We define the 2norm function of the virtual queue vector as:
Define the Lyapunov drift as
Next, define the penalty function at frame as , where is a fixed tradeoff parameter. Then, the driftpluspenalty methodology suggests that we can stabilize the virtual queues by choosing an action to greedily minimize the following driftpluspenalty expression, with the observed , and :
The penalty term uses the variable, which depends on events from all previous frames. This penalty does not fit the rubric of [29] and convergence of the algorithm does not follow from prior work. A significant thrust of the current paper is convergence analysis via the trimmed pseudo averages defined in the previous subsection.
In order to obtain an upper bound on , we square both sides of (5) and use the fact that ,
(7) 
Then we have
(8) 
where the last inequality follows from Proposition 1. Thus, as we have already seen in Algorithm 1, the proposed algorithm observes the vector , the random event and the trimmed pseudo average at frame , and minimizes the right hand side of (8).
In this section, we show how the bound (8) leads to the feasibility of the proposed algorithm. Define as the system history information up until frame . Formally, is a filtration where each is the algebra generated by all the random variables before frame . Notice that since and depend only on the events before frame , contains both and . The following important lemma gives a stability criterion for any given real random process with certain negative drift property:
Lemma 1 (Theorem 2.3 of [30])
Let be a real random process over satisfying the following two conditions for a fixed :

For any , , for some .

Given , , with some .
Suppose further that is given and finite, then, at every , the following bound holds:
Thus, in order to show the stability of the virtual queue process, it is enough to test the above two conditions with . The following lemma shows that satisfies these two conditions:
Lemma 2 (Drift condition)
The central idea of the proof is to plug the slackness policy specified in Assumption 5 into the right hand side of (8). A similar idea has been presented in the Lemma 6 of [31] under the bounded increment of the virtual queue process. Here, we generalize the idea to the case where the increment of the virtual queues contains exponential type random variables and . Note that the boundedness of is crucial for the argument to hold, which justifies the truncation of pseudo average in the algorithm. Lemma 1 is proved in the Appendix.
Combining the above two lemmas, we immediately have the following corollary:
Corollary 1 (Exponential decay)
Given , the following holds for any under the proposed algorithm,
(9) 
where
and are as defined in Lemma 2.
With Corollary 1 in hand, we can prove the following theorem:
Theorem 1 (Feasibility)
By queue updating rule (5), for any and any , one has
Fix as a positive integer. Then, summing over all ,
Since and ,
(10) 
Define the event
By the Markov inequality and Corollary 1, for any , we have
where is defined in Corollary 1. Thus, we have
Thus, by the BorelCantelli lemma ([33]),
Since is arbitrary, letting gives
Finally, taking the from both sides of (10) and substituting in the above equation gives the claim. \@endparenv
In this section, we show that the proposed algorithm achieves time average penalty within of the optimal objective . Since the algorithm meets all the constraints, it follows,
Thus, it is enough to prove the following theorem:
Theorem 2 (Near optimality)
For any and , the objective value produced by the proposed algorithm is near optimal with
i.e. the algorithm achieves near optimality.
The proof of this theorem is relatively involved, thus, we would like to sketch the roadmap of our proof before jumping into the details.
The key point of the proof is to bound the pseudo average asymptotically from above by , which is achieved in Theorem 3 below. We then prove Theorem 3 through the following threestepconstructions:

Introduce a termwise truncated version of , denoted as , who has the same limit as (shown in Lemma 5), so that it is enough to show asymptotically.
We start with a preparation lemma illustrating that the original pseudo average behaves almost the same as the trimmed pseudo average . Recall that is defined as:
Lemma 3 (Equivalence relation)
For any ,

if and only if .

if and only if .

if and only if .

if and only if .
This lemma is intuitive and the proof is shown in the Appendix. We will see that this is sometimes easier to work with than , and we will prove results on which extend naturally to by Lemma 3.
The following lemma states that the optimality of (1)(3) is achievable within the closure of the set of all oneshot averages specified in Assumption 4:
Lemma 4 (Stationary optimality)
The proof of this lemma is similar to the proof of Theorem 4.5 as well as Lemma 7.1 of [32]. We omit the details for brevity.
We start the truncation by picking up an small enough so that . We aim to show . By Lemma 3, it is enough to show . The following lemma tells us it is enough to prove it on a further termwise truncated version of .
Lemma 5 (Truncation lemma)
Consider any frame such that there is a discrepancy between the summand of and , i.e.
(13) 
By CauchySchwartz inequality, this implies
Thus, at least one of the following three events happened:

.

.

.
where is defined in (4). Indeed, the occurence of one of the three events is necessary for (13) to happen. We then argue that these three events jointly occur only finitely many times. Thus, as , the discrepancies are negligible.
Assume the event occurs, then, since , it follows . Then, we have
where the second to last inequality follows from Markov inequality and the last inequality follows from Assumption 1.
Assume the event occurs, then, we have
Thus,
where the second to last inequality follows from Markov inequality and the last inequality follows from Corollary 1.
Assume the event occurs. Again, by Assumption 1 and Markov inequality,
where the last inequality follows from Assumption 1 again. Now, by a union bound,
and thus,
By the BorelCantelli lemma, we have the joint event occurs only finitely many times with probability 1, and our proof is finished. \@endparenv
Lemma 5 is crucial for the rest of the proof. Specifically, it creates an alternative sequence which has the following two properties:

We know exactly what the upper bound of each of the summands is, whereas in , there is no exact bound for the summand due to and other exponential type random variables.

For any , we have . Thus, if for some , then, .
The following preliminary lemma demonstrates a negative drift property for each of the summands in .
Lemma 6 (Key feature inequality)
For any , if , then, we have
Since the proposed algorithm minimizes (6) over all possible decisions in , it must achieve value less than or equal to that of any randomized stationary algorithm . This in turn implies,
Taking expectation from both sides with respect to and using the fact that randomized stationary algorithms are i.i.d. over frames and independent of , we have
for any . Since specified in Lemma 4 is in the closure of , we can replace by the tuple and the inequality still holds. This gives