Jointly Optimal Sensing and Resource Allocation for Multiuser Overlay Cognitive Radios
Abstract
Successful deployment of cognitive radios requires efficient sensing of the spectrum and dynamic adaptation of the available resources according to the sensed (imperfect) information. While most works design these two tasks separately, in this paper we address them jointly. In particular, we investigate an overlay cognitive radio with multiple secondary users that access orthogonally a set of frequency bands originally devoted to primary users. The schemes are designed to minimize the cost of sensing, maximize the performance of the secondary users (weighted sum rate), and limit the probability of interfering the primary users. The joint design is addressed using dynamic programming and nonlinear optimization techniques. A twostep strategy that first finds the optimal resource allocation for any sensing scheme and then uses that solution as input to solve for the optimal sensing policy is implemented. The twostep strategy is optimal, gives rise to intuitive optimal policies, and entails a computational complexity much lower than that required to solve the original formulation.
Cognitive radios (CRs)[CR] \newabbrev\SUsecondary users (SUs)[SU] \newabbrev\PUprimary users (PUs)[PU] \newabbrev\RAresource allocation (RA)[RA] \newabbrev\CSIchannel state information (CSI)[CSI] \newabbrev\DPdynamic programming (DP)[DP] \newabbrev\POMDPpartially observable Markov decision process (POMDP)[POMDP] \newabbrev\NCnetwork controller (NC)[NC] \newabbrev\SIPNstate information of the primary network (SIPN)[SIPN] \newabbrev\SISNstate information of the secondary network (SISN)[SISN] \newabbrev\ROCreceiver operating characteristic (ROC)[ROC] \newabbrev\SNRsignaltonoise ratio (SNR)[SNR] \newabbrev\QoSquality of service (QoS)[QoS] \newabbrev\wrtwith respect to (w.r.t.)[w.r.t.] \newabbrev\IRIinstantaneous reward indicators (IRIs)[IRI] \newabbrev\MDPMarkov Decision Processes (MDPs)[MDP]
Cognitive radios, sequential decision making, dual decomposition, partially observable Markov decision processes
I Introduction
\CRare viewed as the nextgeneration solution to alleviate the perceived spectrum scarcity. When \CRs are deployed, the \SU have to sense their radio environment to optimize their communication performance while avoiding (limiting) the interference to the \PU. As a result, effective operation of \CRs requires the implementation of two critical tasks: i) sensing the spectrum and ii) dynamic adaptation of the available resources according to the sensed information [10]. To carry out the sensing task two important challenges are: C1) the presence of errors in the measurements that lead to errors on the channel occupancy detection and thus render harmless SU transmissions impossible; and C2) the inability to sense the totality of the timefrequency lattice due to scarcity of resources (time, energy, or sensing devices). Two additional challenges that arise to carry out the \RAtask are: C3) the need of the \RAalgorithms to deal with channel imperfections; and C4) the selection of metrics that properly quantify the reward for the \SUs and the damage for the \PUs.
Many alternatives have been proposed in the \CRliterature to deal with these challenges. Different forms of imperfect \CSI, such as quantized or noisy \CSI, have been used to deal with C1 [20]. However, in the context of \CR, fewer works have considered the fact that the \CSImay be not only noisy but also outdated, or have incorporated those imperfections into the design of \RAalgorithms [6]. The inherent tradeoff between sensing cost and throughput gains in C2 has been investigated [14], and designs that account for it based on convex optimization [24] and \DP[6] for specific system setups have been proposed. Regarding C3, many works consider that the \CSIis imperfect, but only a few exploit the statistical model of these imperfections (especially for the time correlation) to mitigate them; see, e.g., [6, 17]. Finally, different alternatives have been considered to deal with C4 and limit the harm that the \SUs cause to the \PUs [9]. The most widely used is to set limits on the peak (instantaneous) and average interfering power. Some works also have imposed limits on the rate loss that \PUs experience [18, 15], while others look at limiting the instantaneous or average probability of interfering the \PU (bounds on the shortterm or longterm outage probability) [22, 17].
Regardless of the challenges addressed and the formulation chosen, the sensing and \RApolicies have been traditionally designed separately. Each of the tasks has been investigated thoroughly and relevant results are available in the literature. However, a globally optimum design requires designing those tasks jointly, so that the interactions among them can be properly exploited. Clearly, more accurate sensing enables more efficient \RA, but at the expense of higher time and/or energy consumption. Early works dealing with joint design of sensing and \RAare [28] and [6]. In such works, imperfections in the sensors, and also time correlation of the state of the primary channel, are considered. As a result, the sensing design is modeled as a \POMDP[4], which can be viewed as a specific class of \DP. The design of the \RAin these works amounts to select the user transmitting on each channel (also known as user scheduling). Under mild conditions, the authors establish that a separation principle holds in the design of the optimal access and sensing policies. Additional works addressing the joint design of sensing and \RA, and considering more complex operating conditions, have been published recently [24, 12]. For a single \SU operating multiple fading channels, [24] relies on convex optimization to optimally design both the \RAand the indexes of the channels to be sensed at every time instant. Assuming that the number of channels that can be sensed at every instant is fixed and that the primary activity is independent across time, the author establishes that the channels to sense are the ones that can potentially yield a higher reward for the secondary user. Joint optimal design is also pursued in [12], although for a very different setup. Specifically, [12] postulates that at each slot, the \CRmust calculate the fraction of time devoted to sense the channel and the fraction devoted to transmit in the bands which are found to be unoccupied. Clearly, a tradeoff between sensing accuracy and transmission rate emerges. The design is formulated as an optimal stopping problem, and solved by means of Lagrange relaxation of \DP[5]. However, none of these two works takes into account the temporal correlation of the \SIPN.
The objective of this work is to design the sensing and the \RApolicies jointly while accounting for the challenges C1C4. The specific operating conditions considered in the paper are described next. We analyze an overlay^{1}^{1}1Some authors refer to overlay networks as interweave networks, see, e.g., [8]. \CRwith multiple \SUs and \PUs. \SUs are able to adapt their power and rate loadings and access orthogonally a set of frequency bands. Those bands are originally devoted to \PUs transmissions. Orthogonally here means that if a \SU is transmitting, no other \SU can be active in the same band. The schemes are designed to maximize the sumaverage rate of the \SUs while adhering to constraints that limit the maximum “average power” that \SUs transmit and the average “probability of interfering” the \PUs. It is assumed that the \CSIof the \SU links is instantaneous and free of errors, while the \CSIof the \PUs activity is outdated and noisy. A simple firstorder hidden Markov model is used to characterize such imperfections. Sensing a channel band entails a given cost, and at each instant the system has to decide which channels (if any) are sensed.
The jointly optimal sensing and \RAschemes will be designed using \DPand nonlinear optimization techniques. \DPtechniques are required because the activity of \PUs is assumed to be correlated across time, so that sensing a channel has an impact not only for the current instant, but also for future time instants [28]. To solve the joint design, a twostep strategy is implemented. In the first step, the sensing is considered given and the optimal \RAis found for any fixed sensing scheme. This problem was recently solved in [19, 17]. In the second step, the results of the first step are used as input to obtain the optimal sensing policy. The motivation for using this twostep strategy is twofold. First, while the joint design is non convex and has to be solved using \DPtechniques, the problem in the first step (optimal \RAfor a fixed sensing scheme) can be recast as a convex one. Second, when the optimal \RAis substituted back into the original joint design, the resulting problem (which does need to be solved using \DPtechniques) has a more favorable structure. More specifically, while the original design problem was a constrained \DP, the updated one is an unconstrained \DPproblem which can be solved separately for each of the channels.
The rest of the paper is organized as follows. Sec. II describes the system setup and introduces notation. The optimization problem that gives rise to the optimal sensing and \RAschemes is formulated in Sec. III. The solution for the optimal \RAgiven the sensing scheme is presented in Sec. IV. The optimization of the sensing scheme is addressed in Sec. V. The section begins with a brief review of \DPand \POMDPs. Then, the problem is formulated in the context of \DPand its solution is developed. Numerical simulations validating the theoretical claims and providing insights on our optimal schemes are presented in Sec. VI. Sec. VII analyzes the main properties of our jointly optimal \RAand sensing policies, provides insights on the operation of such policies, and points out future lines of work.^{2}^{2}2 Notation: denotes the optimal value of variable ; expectation; the boolean “and” operator; the indicator function ( if is true and zero otherwise); and the projection of onto the nonnegative orthant, i.e., .
Ii System setup and state information
This section is devoted to describe the basic setup of the system. We begin by briefly describing the system setup and the operation of the system (tasks that the system runs at every time slot). Then, we explain in detail the model for the \CSI, which will play a critical role in the problem formulation. The resources that \SUs will adapt as a function of the \CSIare described in the last part of the section.
We consider a \CRscenario with several \PUs and \SUs. The frequency band of interest (the portion of spectrum that is licensed to \PUs, or the subset of this shared with the \SUs, if not all) is divided into frequencyflat orthogonal subchannels (indexed by ). Each of the secondary users (indexed by ) opportunistically accesses any number of these channels during a time slot (indexed by ). Opportunistic here means that the user accessing each channel will vary with time, with the objective of optimally utilizing the available channel resources. For simplicity, we assume that there is a \NCwhich acts as a central scheduler and will also perform the task of sensing the medium for primary presence. The scheduling information will be forwarded to the mobile stations through a parallel feedback channel. The results hold for onehop (either cellular or anytoany) setups.
Next, we briefly describe the operation of the system. A more detailed description will be given in Sec. III, which will rely on the notation and problem formulation introduced in the following sections. Before starting, it is important to clarify that we focus on systems where the \SIPNis more difficult to acquire than the \SISN. As a result, we will assume that \SISNis errorfree and acquired at every slot , while \SIPNis not. With these considerations in mind, the \CRoperates as follows. At every slot the following tasks are run sequentially: T1) the \NCacquires the \SISN; T2) the \NCrelies on the output of T1 (and on previous measurements) to decide which channels to sense (if any), then the output of the sensing is used to update the \SIPN; and T3) the \NCuses the outputs of T1 and T2 to find the optimal \RAfor instant . Overheads associated with acquisition of the \SISNand notification of the optimal \RAto the \SUs are considered negligible. Such an assumption facilitates the analysis, and it is reasonable for scenarios where the \SUs are deployed in a relatively small area which allows for lowcost signaling transmissions.
Iia State information and sensing scheme
We begin by introducing the model for the \SISN. The noisenormalized square magnitude of the fading coefficient (power gain) of the channel between the th secondary user and its intended receiver on frequency during slot is denoted as . Channels are random, so that is a stochastic process, which is assumed to be independent across time. The values of for all and form the \SISNat slot . We assume that the \SISNis perfect, so that the values of at every time slot are know perfectly (errorfree). While \SISNcomprises the power gains of the secondary links, the \SIPNaccounts for the channel occupancy. We will assume that the primary system contains one user per channel. This assumption is made to simplify the analysis and it is reasonable for certain primary systems, e.g. mobile telephony where a single narrowband channel is assigned to a single user during the course of a call. Since we consider an overlay scenario, it suffices to know whether a given channel is occupied or not [8]. This way, when a \PU is not active, opportunities for \SUs to transmit in the corresponding channel arise. The primary system is not assumed to collaborate with the secondary system. Hence, from the point of view of the \SUs, the behavior of \PUs is a stochastic process independent of . With these considerations in mind, the presence of the primary user in channel at time is represented by the binary state variable (0/idle, 1/busy). Each primary user’s behavior will be modeled as a simple GilbertElliot channel model, so that is assumed to remain constant during the whole time slot, and then change according to a twostate, time invariant Markov chain. The Markovian property will be useful to keep the \DPmodeling simple and will also be exploited to recursively keep track of the \SIPN. Nonetheless, more advanced models can be considered without paying a big computational price [17, 27]. With , the dynamics for the GilbertElliot model are fully described by the Markov transition matrix . Sec. VII discusses the implications of relaxing some of these assumptions.
While knowledge of at instant was assumed to be perfect (deterministic), knowledge of at instant is assumed to be imperfect (probabilistic). Two important sources of imperfections are: i) errors in the sensing process and ii) outdated information (because the channels are not always sensed). For that purpose, let denote a binary design variable which is 1 if the th channel is sensed at time , and 0 otherwise. Moreover, let denote the output of the sensor if indeed ; i.e., if the th channel has been sensed. We will assume that the output of the sensor is binary and may contain errors. To account for asymmetric errors, the probabilities of false alarm and miss detection are considered. Clearly, the specific values of and will depend on the detection technique the sensors implement (matched filter, energy detector, cyclostationary detector, etc.) and the working point of the \ROCcurve, which is usually controlled by selecting a threshold [25]. In our model, this operation point is chosen beforehand and it is fixed during the system operation, so that the values of and are assumed known. As already mentioned, the sensing imperfections render the knowledge of at instant probabilistic. In other words, is a partially observable state variable. The knowledge about the value of at instant will be referred to as (instantaneous) belief, also known as the information process. For a given instant , two different beliefs are considered: the predecision belief and the postdecision belief . Intuitively, contains the information about before the sensing decision has been made (i.e., at the beginning of task T2), while contains the information about once and (if ) are known (i.e., at the end of task T2). Mathematically, if represents the history of all sensing decisions and measurements, i.e., ; then and . For notational convenience, the beliefs will also be expressed as vectors, with and . Using basic results from Markov chain theory and provided that (timecorrelation model) is known, the expression to get the predecision belief at time slot is
(1) 
Differently, the expression to get depends on the sensing decision . If , no additional information is available, so that
(2) 
If , the belief is corrected as , with
(3) 
where with is a diagonal matrix with entries and . Note that the denominator is the probability of an outcome conditioned to a specified belief: , so that (3) corresponds to the correction step of a Bayesian recursive estimator. If no information about the initial state of the \PU is available, the best choice is to initialize to the stationary distribution of the Markov chain associated with channel (i.e., the principal eigenvector of ).
In a nutshell, the actual state of the primary and secondary networks is given by the random processes and , which are assumed to be independent. The operating conditions of our \CRare such that at instant , the value of is perfectly known, while the value of is not. As a result, the \SIPNis not formed by , but by and which are a probabilistic description of . The system will perform the sensing and \RAtasks based on the available \SISNand \SIPN. In particular, the sensing decision will be made based on and , while the \RAwill be implemented based on and .
IiB Resources at the secondary network
We consider a secondary network where users are able to implement adaptive modulation and power control, and share orthogonally the available channels. To describe the channel access scheme (scheduling) rigorously, let be a boolean variable so that if \SU accesses channel and zero otherwise. Moreover, let be a nonnegative variable denoting the nominal power \SU transmits in channel , and let be its corresponding rate. We say that the is a nominal power in the sense that power is consumed only if the user is actually accessing the channel. Otherwise the power is zero, so that the actual (effective) power user loads in channel can be written as .
The transmission bit rate is obtained through Shannon’s capacity formula [13]: where is a \SNRgap that accounts for the difference between the theoretical capacity and the actual rate achieved by the modulation and coding scheme the \SU implements. This is a bijective, nondecreasing, concave function with and it establishes a relationship between power and rate in the sense that controlling implies also controlling .
The fact of the access being orthogonal implies that, at any time instant, at most one \SU can access the channel. Mathematically,
(4) 
Note that (4) allows for the event of all being zero for a given channel . That would happen, if, for example, the system thinks that it is very likely that channel is occupied by a \PU.
Iii Problem statement
The approach in this paper is to design the sensing and \RAschemes as the solution of a judiciously formulated optimization problem. Consequently, it is critical to identify: i) the design (optimization) variables, ii) the state variables, iii) the constraints that design and state variables must obey, and iv) the objective of the optimization problem.
The first two steps were accomplished in the previous section, stating that the design variables are , and (recall that there is no need to optimize over ); and that the state variables are (\SISN), and and (\SIPN).
Moving to step iii), the constraints that the variables need to satisfy can be grouped into two classes. The first class is formed by constraints that account for the system setup. This class includes constraint (4) as well as the following constraints that were implicitly introduced in the previous section: , and . The second class is formed by constraints that account for \QoS. In particular, we consider the following two constraints. The first one is a limit on the maximum average (longterm) power a \SUcan transmit. By enforcing an average consumption constraint, opportunistic strategies are favored because energy can be saved during deep fadings (or when the channel is known to be occupied) and used during transmission opportunities. Transmission opportunities are time slots where the channel is certainly known to be idle and the fading conditions are favorable. Mathematically, with denoting such maximum value, the average power constraint is written as:
(5) 
where is a discount factor such that more emphasis is placed in near future instants. The factor ensures that the averaging operator is normalized; i.e., that . As explained in more detail in Sec. V, using an exponentially decaying average is also useful from a mathematical perspective (convergence and existence of stationary policies are guaranteed).
While the previous constraint guarantees \QoSfor the \SUs, we also need to guarantee a level of \QoSfor the \PUs. As explained in the introduction, there are different strategies to limit the interference that \SUs cause to \PUs; e.g., by imposing limits on the interfering power at the \PUs, or on the rate loss that such interference generates [17]. In this paper, we will guarantee that the longterm probability of a \PU being interfered by \SUs is below a certain prespecified threshold . Mathematically, we require for each band . Using Bayes’ theorem, and capitalizing on the fact that both and are boolean variables, the constraint can be rewritten as:
(6) 
where , which is assumed known, denotes the stationary probability of the th band being occupied by the corresponding primary user. Writing the constraint in this form reveals its underlying convexity. Before moving to the next step, two clarifications are in order. The first one is on the practicality of (6). Constraints that allow for a certain level of interference are reasonable because errorfree sensing is unrealistic. Indeed, our model assumes that even if channel is sensed as idle, there is a probability of being occupied. Moreover, when the interference limit is formulated as longterm constraint (as it is in our case), there is an additional motivation for the constraint. The system is able to exploit the socalled interference diversity [26]. Such diversity allows \SUs to take advantage of very good channel realizations even if they are likely to interfere \PUs. To balance the outcome, \SUs will be conservative when channel realizations are not that good and may remain silent even if it is likely that the \PU is not present. The second clarification is that we implicitly assumed that \SU transmissions are possible even if the \PU is present. The reason is twofold. First, the fact that a \SU transmitter is interfering a \PU receiver, does not necessarily imply that the reciprocal is true. Second, since the \NCdoes not have any control over the power that primary transmitters use, the interfering power at the secondary receiver is a state variable. As such, it could be incorporated into as an additional source of noise.
The fourth (and last) step to formulate the optimization problem is to design the metric (objective) to be maximized. Different utility (reward) and cost functions can be used to such purpose. As mentioned in the introduction, in this work we are interested in schemes that maximize the weighted sum rate of the \SUs and minimize the cost associated with sensing. Specifically, we consider that every time that channel is sensed, the system has to pay a price . We assume that such a price is fixed and known beforehand, but timevarying prices can be accommodated into our formulation too (see Sec. VIIB for additional details). This way, the sensing cost at time is . Similarly, we define the utility for the \SUs at time as , where is a userpriority coefficient. Based on these definitions, the utility for our \CRat time is . Finally, we aim to maximize the longterm utility of the system denoted by and defined as
(7) 
With these notational conventions, the optimal , and will be obtained as the solution of the following constrained optimization problem.
(8a)  
(8b)  
(8c)  
(8d) 
Note that constraints in (8b) and (8c) affect the design variables involved in the \RAtask ( and ), while (8d) affects the design variables involved in the sensing task (). Moreover, the reason for writing (8b) and (8c) separately is that (8b) refers to constraints that need to hold for each and every time instant , while (8c) refers to constraints that need to hold in the longterm.
The main difficulty in solving (8) is that the solution for all time instants has to be found jointly. The reason is that sensing decisions at instant have an impact not only at that instant, but at future instants too. As a result, a separate perslot optimization approach is not optimal, and \DPtechniques have to be used instead. Since \DPproblems generally have exponential complexity, we will use a two stepstrategy to solve (8) which will considerably reduce the computational burden without sacrificing optimality. To explain such a strategy, it is convenient to further clarify the operation of the system. In Sec. II we explained that at each slot , our \CRhad to implement three main tasks: T1) acquisition of the \SISN, T2) sensing and update of the \SIPN, and T3) allocation of resources. In what follows, task T2 is split into 3 subtasks, so that the \CRruns five sequential steps:

T1) At the beginning of the slot, the system acquires the exact value of the channel gains ;

T2.1) the Markov transition matrix and the postdecision beliefs of the previous instant are used to obtain predecision beliefs via (1);

T2.2) and are used to find ;

T3) and are used to find the optimal value of and , and the \SUs transmit accordingly.
The twostep strategy to solve (8) will proceed as follows. In the first step, we will find the optimal and for any sensing scheme. Such a problem is simpler than the original one in (8) not only because the dimensionality of the optimization space is smaller, but also because we can ignore (drop) all the terms in (8) that depend only on . This will be critical, because if the sensing is not optimized, a perslot optimization \wrtthe remaining design variables is feasible. In the second step, we will substitute the output of the first step into (8) and solve for the optimal . Clearly, the output of the first step will be used in T3 while the output of the second step will be used in T2.2. The optimization in the first step (\RA) is addressed next, while the optimization in the second step (sensing) is addressed in Sec. V.
Iv Optimal RA for the secondary network
According to what we just explained, the objective of this section is to design the optimal \RA(scheduling and powers) for a fixed sensing policy. It is worth stressing that solving this problem is convenient because: i) it corresponds to one of the tasks our \CRhas to implement; ii) it is a much simpler problem than the original problem in (8), indeed the problem in this section has a smaller dimensionality and, more importantly, can be recast as a convex optimization problem; and iii) it will serve as an input for the design of the optimal sensing, simplifying the task of finding the global solution of (8).
Because in this section the sensing policy is considered given (fixed), is not a design variable, and all the terms that depend only on can be ignored. Specifically, the sensing cost in (8a) and the constraint in (8d) can be dropped. The former implies that the new objective to optimize is . With these considerations in mind, we aim to solve the following problem [cf. (8)]
(9a)  
(9b) 
A slightly modified version of this problem was recently posed and solved in [17]. For this reason we organize the remaining of this section into two parts. The first one summarizes (and adapts) the results in [17], presenting the optimal \RA. The second part is devoted to introduce new variables that will serve as input for the design of the optimal sensing in Sec. V.
Iva Solving for the RA
It can be shown that after introducing some auxiliary (dummy) variables and relaxing the constraint to , the resultant problem in (9) is convex. Moreover, with probability one the solution to the relaxed problem is the same than that of the original problem; see [17] as well as [16] for details on how to obtain the solution for this problem. The approach to solve (9) is to dualize the longterm constraints in (8c). For such a purpose, let and be the Lagrange multipliers associated with constraints (5) and (6), respectively. It can be shown then that the optimal solution to (9) is
(10)  
(11)  
(12)  
(13) 
Two auxiliary variables and have been defined. Such variables are useful to express the optimal \RAbut also to gain insights on how the optimal \RAoperates. Both variables can be viewed as \IRIwhich represent the reward that can be obtained if is set to one. The indicator considers information only of the secondary network and represents the best achievable tradeoff between the rate and power transmitted by the \SU. The risk of interfering the \PU is considered in , which is obtained by adding an interferencerelated term to . Clearly, the (positive) multipliers and can be viewed as power and interference prices, respectively. Note that (11) dictates that only the user with highest \IRIcan access the channel. Moreover, it also establishes that if all users obtain a negative \IRI, then none of them should access the channel (in other words, an idle \SU with zero \IRIwould be the winner during that time slot). This is likely to happen if, for example, the probability of the th \PU being present is close to one, so that the value of in (12) is high, rendering negative for all .
The expressions in (10)(13) also reveal the favorable structure of the optimal \RA. The only parameters linking users, channels and instants are the multipliers and . Once they are known, the optimal \RAcan be found separately. Specifically: i) the power for a given userchannel pair, which is the one that maximizes the corresponding \IRI(setting the derivative of (13) to zero yields (10)), is found separately from the power for other users and channels; and ii) the optimal scheduling for a given channel, which is the one that maximizes the \IRIwithin the corresponding channel, is found separately from that in other channels. Since once the multipliers are known, the \IRIs depend only on information at time , the two previous properties imply that the optimal \RAcan be found separately for each time instant . Additional insights on the optimal \RAschemes will be given in Sec. VIIA.
Several methods to set the value of the dual variables and are available. Since, after relaxation, the problem has zero duality gap, there exists a constant (stationary) optimal value for each multiplier, denoted as and , such that substituting and into (10) and (12) yields the optimal solution to the \RAproblem. Optimal Lagrange multipliers are rarely available in closed form and they have to be found through numerical search, for example by using a dual subgradient method aimed to maximize the dual function associated with (9) [2]. A different approach is to rely on stochastic approximation tools. Under this approach, the dual variables are rendered time variant, i.e., and . The objective now is not necessarily trying to find the exact value of and , but online estimates of them that remain inside a neighborhood of the optimal value. See Sec. VIIC and [24, 17] for further discussion on this issue.
IvB RA as input for the design of the optimal sensing
The optimal solution in (10)(13) will serve as input for the algorithms that design the optimal sensing scheme. For this reason, we introduce some auxiliary notation that will simplify the mathematical derivations in the next section. On top of being useful for the design of the optimal sensing, the results in this section will help us to gain insights and intuition on the properties of the optimal \RA. Specifically, let be an auxiliary variable referred to as global \IRI, which is defined as
(14) 
Due to the structure of the optimal \RA, the \IRIfor channel can be rewritten as [cf. (11), (12)]:
(15) 
Mathematically, represents the contribution to the Lagrangian of (9) at instant when and for all and . Intuitively, one can view as the instantaneous functional that the optimal \RAmaximizes at instant .
Key for the design of the optimal sensing is to understand the effect of the belief on the performance of the secondary network, thus, on . For such a purpose, we first define the \IRIfor the \SUs in channel as . Then, we use to define the nominal \IRIvector as
(16) 
Such a vector can be used to write as a function of the belief . Specifically,
(17) 
This suggests that the optimization of the sensing (which affects the value of ) can be performed separately for each of the channels. Moreover, (17) also reveals that can be viewed as the expected \IRI: the second entry of is the \IRIif the \PU is present, the first entry of is the \IRIif it is not, and the entries of account for the corresponding probabilities, so that the expectation is carried over the \SIPNuncertainties. Equally important, while the value of is only available after making the sensing decision, the value of is available before making such a decision. In other words, sensing decisions do not have an impact on , but only on . These properties will be exploited in the next section.
V Optimal sensing
The aim of this section is leveraging the results of Secs. III and IV to design the optimal sensing scheme. Recall that current sensing decisions have an impact not only on the current reward (cost) of the system, but also on future rewards. This in turn implies that future sensing decisions are affected by the current decision, so that the sensing decisions across time form a string of events that has to be optimized jointly. Consequently, the optimization problem has to be posed as a \DP. The section is organized as follows. First, we present a brief summary of the relevant concepts related to \DPand \POMDPwhich will be important to address the design of the optimal sensing for the system setup considered in this paper (Sec. VA). Readers familiar with \DPand \POMDPcan skip that section. Then, we substitute the optimal \RApolicy obtained in Sec. IV into the original optimization problem presented in Sec. III and show that the design of the optimal sensing amounts to solving a set of separate unconstrained \DPproblems (Sec. VB). Lastly, we obtain the solution to each of the \DPproblems formulated (Sec. VC). It turns out that the optimal sensing leverages: , the sensing cost at time ; the expected channel \IRIat time , which basically depends on (\SISN) and the predecision belief (\SIPN); and the future reward for time slots . The future reward is quantified by the value function associated with each channel’s \DP, which plays a fundamental role in the design of our sensing policies. Intuitively, a channel is sensed if there is uncertainty on the actual channel occupancy (\SIPN) and the potential reward for the secondary network is high enough (\SISN). The expression for the optimal sensing provided at the end of this section will corroborate this intuition.
Va Basic concepts about DP
DP is a set of techniques and strategies used to optimize the operation of discretetime complex systems, where decisions have to be made sequentially and there is a dependency among decisions in different time instants. These systems are modeled as statespace models composed of: a set of state variables ; a set of actions which are available to the controller and which can depend on the state ; a transition function that describes the dynamics of the system as a function of the current state and the action taken , where is a random (innovation) variable; and a function that defines the reward associated with a state transition or a stateaction pair . In general, finding the optimal solution of a \DPis computationally demanding. Unless the structure of the specific problem can be exploited, complexity grows exponentially with the size of the state space, the size of the action space, and the length of the temporal horizon. This is commonly referred to as the triple curse of dimensionality [21]. Two classical strategies to mitigate such a problem are: i) framing the problem into a specific, previously studied model and ii) find approximate solutions that allow to reduce the computational cost in exchange for a small loss of optimality.
DP problems can be classified into finitehorizon and infinitehorizon problems. For the latter class, which is the one corresponding to the problem in this paper, it is assumed that the system is going to be operated during a very large time lapse, so that actions at any time instant are chosen to maximize the expected longterm reward, i.e.,
(18) 
The role of the discount factor is twofold: i) it encourages solutions which are focused on early rewards; and ii) it contributes to stabilize the numerical calculation of the optimal policies. In particular, the presence of guarantees the existence of a stationary policy, i.e. a policy where the action at a given instant is a function of the system state and not the time instant. Note that multiplying (18) by factor , so that the objective resembles the one used through paper, does not change the optimal policy.
Key to solve a \DPproblem is defining the socalled value function that associates a real number with a state and a time instant. This number represents the expected sum reward that can be obtained, provided that we operate the system optimally from the current time instant until the operating horizon. If a minimization formulation is chosen, the value function is also known as costtogo function [3]. The relationship between the optimal action at time and the value function at time , denoted as , is given by Bellman’s equations [3, 21]:
(19a)  
(19b) 
where is the information that arrives at time and thus we have to take the expectation over . The value function for different time instants can be recursively computed by using backwards induction. Moreover, for infinite horizon formulations with , it holds that the value function is stationary. As a result, the dependence of on can be dropped and (19) can be rewritten using the stationary value function . In this scenario, alternative techniques that exploit the fact of the value function being stationary (such as “value iteration” and “policy iteration” [21, Ch. 2]) can be used to compute .
VA1 Partially Observable Markov Decision Processes
\MDPare an important class within \DPproblems. For such problems, the state transition probabilities depend only on the current stateaction pair, the average reward in each step only depends on the stateaction pair, and the system state is fully observable. \MDPs can have finite or infinite stateaction spaces. \MDPs with finite stateaction spaces can be solved exactly for finitehorizon problems. For infinite horizon problems, the solution can be approximated with arbitrary precision. A partially Observable \MDP(i.e. a \POMDP) can be viewed as a generalization of \MDPfor which the state is not always known perfectly. Only an observation of the state (which may be affected by errors, missing data or ambiguity) is available. To deal with these problems, it is assumed that an observation function, which assigns a probability to each observation depending on the current state and action, is known. When dealing with \POMDPs, there is no distinction between actions taken to change the state of the system under operation and actions taken to gain information. This is important because, in general, every action has both types of effect.
The \POMDPframework provides a systematic method of using the history of the system (actions and observations) to aid in the disambiguation of the current observation. The key point is the definition of an internal belief state accounting for previous actions and observations. The belief state is useful to infer the most probable state of the system. Formally, the belief state is a probability distribution over the states of the system. Furthermore, for \POMDPs this probability distribution comprises a sufficient statistic for the past history of the system. In other words, the process over belief states is Markov, and no additional data about the past would help to increase the agent’s expected reward [1]. The optimal policy of a \POMDPagent must map the current belief state into an action. This implies that a discrete statespace \POMDPcan be reformulated (and viewed) as a continuousspace \MDP. This equivalent \MDPis defined such that the state space is the set of possible belief spaces of the \POMDP–the probability simplex of the original state space. The set of actions remains the same; and the statetransition function and the reward functions are redefined over the belief states. More details about how these functions are redefined in general cases can be found at [11]. Clearly, our problem falls into this class. The actual \SIPNis Markovian, while the errors in the \CSIrender the \SIPNpartially observable. These specific functions corresponding to our problem are presented in the following sections.
VB Formulating the optimal sensing problem
The aim of this section is to formulate the optimal decision problem as a standard (unconstrained) \DP. The main task is to substitute the optimal \RAinto the original optimization problem in (8). Recall that optimization in (8) involved variables , and , and the sets of constraints in (8b), (8c) and (8d), the latter requiring . When the optimal solution for , presented in Sec. IV is substituted into (8), the resulting optimization problem is
(20a)  
(20b) 
where stands for the total utility given the optimal \RAand is defined as
(21) 
which, using the definitions introduced in Sec. IVB, can be rewritten as [cf. (15) and (17)]
(22)  
(23) 
The three main differences between (20) and the original formulation in (8) are that now: i) the only optimization variables are ; ii) because the optimal \RAfulfills the constraints in (8b) and (8c), the only constraints that need to be enforced are (8d), which simply require [cf. (20b)]; and iii) as a result of the Lagrangian relaxation of the DP, the objective has been augmented with the terms accounting for the dualized constraints.
Key to find the solution of (20) will be the facts that: i) is independent of , and ii) that is independent of for . The former implies that the state transition functions for do not depend on , while the latter allows to solve for each of the channels separately. Therefore, we will be able to obtain the optimal sensing policy by solving separate \DPs (\POMDPs), which will rely only on state information of the corresponding channel. Specifically, the optimal sensing can be found as the solution of the following \DP:
(24) 
which can be separated channelwise. Clearly, the reward function for the th \DPis
(25) 
The structure of (25) manifests clearly that this is a joint design because affects the two terms in (25). The first term (which accounts for the cost of the sensing scheme) is just the product of constant and the sensing variable . The second term (which accounts for the reward of the \RA) is the dot product of vectors (which does not depend on ) and (which does depend on ). The expression in (25) also reveals that encapsulates all the information pertaining the \SUs which is relevant to find . In other words, in lieu of knowing , and , it suffices to know .
VC Bellman’s equations and optimal solution
To find , we will derive the Belmman’s equations associated with (26). For such a purpose, we split the objective in (26) into the present and future rewards and drop the constant factor . Then, (26) can be rewritten as
(27) 
It is clear that the expected reward at time slot depends on –recall that both terms in (25) depend on . Moreover, the expected reward at time slots also depend on the current . The reason is that for depend on the [cf. (3)]. This is testament to the fact that our problem is indeed a \POMDP: current actions that improve the information about the current state have also an impact on the information about the state in future instants.
To account for that effect in the formulation, we need to introduce the value function that quantifies the expected sum reward on channel for all future instants. Recall that due to the fact of (26) being and infinite horizon problem with , the value function is stationary and its existence is guaranteed [cf. Sec. VA]. Stationarity implies that the expression for does not depend on the specific time instant, but only on the state of the system. Since in our problem the state information is formed by the \SISNand the \SIPN, should be written as . However, since is i.i.d. across time and independent of , the alternative value function , where denotes that the expectation is taken over all possible values of , can also be considered. The motivation for using instead of is twofold: it emphasizes the fact that the impact of the sensing decisions on the future reward is encapsulated into , and is a onedimensional function, so that the numerical methods to compute it require lower computational burden.
Based on the previous notation, the standard Bellman’s equations that drive the optimal sensing decision and the value function are [cf. (27)]
(28)  
(29) 
where denotes taking the expectation over the sensor outcomes. Equation (28) exploits the fact of the value function being stationary, manifests the dynamic nature of our problem, and provides further intuition about how sensing decisions have to be designed. The first term in (28), , is the expected shortterm reward conditioned to , while the second term, , is the expected longterm sum reward to be obtained in all future time instants, conditioned to and that every future decision is optimal. Equation (29) expresses the condition that the value function must satisfy in order to be optimal (and stationary) and provides a way to compute it iteratively.
Since obtaining the optimal sensing decision at time slot (and also evaluating the stationarity condition for the value function) boils down to evaluate the objective in (28) for and , in the following we obtain the expressions for each of the two terms in (28) for both and . Key for this purpose will be the expressions to update the belief presented in Sec. IIA. Specifically, expressions in (1)(3) describe how the future beliefs depend on the current belief, on the set of possible actions (sensing decision), and on the random variables associated with those actions (outcome of the sensing process if the channel is indeed sensed).
The expressions for the expected shortterm reward [cf. first summand in (28)] are the following. If , the channel is not sensed, there is no correction step, and the postdecision belief coincides with the predecision belief [cf. (2)]. The expected shortterm reward in this case is:
(30) 
On the other hand, if , the expected shortterm reward is found by averaging over the probability mass of the sensor outcome and subtracting the cost of sensing
(31) 
which, by substituting (3) into (31), yields
(32) 
Once the expressions for the expected shortterm reward are known, we find the expressions for the expected longterm sum reward [cf. second summand in (28)] for both and . If , then there is no correction step [cf. (2)], and using (1)
(33) 
On the other hand, if , the belief for instant is corrected according to (3), and updated for instant using the prediction step in (1) as:
(34) 
Clearly, the expressions for the expected longterm reward in (33) and (34) account for the expected value of at time . Substituting (30), (32), (33) and (34) into (29) yields
(35)  
where for the last term we have used the expression for in (3). Equation (35) is useful not only because it reveals the structure of but also because it provides a mean to compute the value function numerically (e.g., by using the value iteration algorithm [21, Ch. 3]).
Similarly, we can substitute the expressions (30)(34) into (28) and get the optimal solution for our sensing problem. Specifically, the sensing decision at time is
(36) 
The most relevant properties of optimal sensing policy (several of them have been already pointed out) are summarized next: i) it can be found separately for each of the channels; ii) since it amounts to a decision problem, we only have to evaluate the longterm aggregate reward if (the channel is sensed at time ) and that if (the channel is sensed at time ), and make the decision which gives rise to a higher reward; iii) the reward takes into account not only the sensing cost but also the utility and \QoSfor the \SUs (joint design); iv) the sensing at instant is found as a function of both the instantaneous and the future reward (the problem is a DP); vi) the instantaneous reward depends on both the current \SISNand the current \SIPN, while the future reward depends on the current \SIPNand not on the current \SISN; and vii) to quantify the future reward, we need to rely on the value function . The input of this function is the \SIPN. Additional insights on the optimal sensing policy will be given in Sec. VIIA.
Vi Numerical results
Numerical experiments to corroborate the theoretical findings and gain insights on the optimal policies are implemented in this section. Since an RA scheme similar to the one presented in this paper was analyzed in [17], the focus is on analyzing the properties of the optimal sensing scheme. The readers interested can find additional simulations as well as the Matlab codes used to run them in \(\mathrm{http://www.tsc.urjc.es/\sim amarques/simulations/NumSimulations\_% lramjr12.html}\).
The experiments are grouped into two test cases. In the first one, we compare the performance of our algorithms with that of other existing (suboptimal) alternatives. Moreover, we analyze the behavior of the sensing schemes and assess the impact of variation of different parameters (correlation of the \PUs activity, sensing cost, sensor quality, and average \SNR). In the second test case, we provide a graphical representation of the sensing functions in the form of twodimensional decision maps. Such representation will help us to understand the behavior of the optimal schemes.
The parameters for the default test case are listed in Table I. Four channels are considered, each of them with different values for the sensor quality, the sensing cost and the \QoSrequirements. In most cases, the value of has been chosen to be larger than the value of (so that the cognitive diversity can be effectively exploited), while the values of the remaining parameters have been chosen so that the testcase yields illustrative results. The secondary links follow a Rayleigh model and the frequency selectivity is such that the gains are uncorrelated across channels. The parameters not listed in the table are set to one.
\SNR  

1  5 dB  0.09  0.08  [0.95, 0.05; 0.02, 0.98]  1.00  0.30  1  20.0 
2  5 dB  0.09  0.08  [0.95, 0.05; 0.02, 0.98]  1.80  0.05  2  16.0 
3  5 dB  0.05  0.03  [0.95, 0.05; 0.02, 0.98]  1.00  0.10  3  18.0 
4  5 dB  0.05  0.03  [0.95, 0.05; 0.02, 0.98]  1.80  0.10  4  10.0 
Test case 1: Optimality and performance analysis. The objective here is twofold. First, we want to numerically demonstrate that our schemes are indeed optimal. Second, we are also interested in assessing the loss of optimality incurred by suboptimal schemes with low computational burden. Specifically, the optimal sensing scheme is compared with the three suboptimal alternatives described next. A) A myopic policy, which is implemented by setting . This is equivalent to the greedy sensing and \RAtechnique proposed in [24], since it only accounts for the reward of sensing in the current time slot and not in the subsequent time slots. B) A policy which replaces the infinite horizon value function with a horizon1 value function. In other words, a sensing policy that makes the sensing decision at time considering the (expected) reward for instants and . C) A ruleofthumb sensing scheme implementing the simple (separable) decision function: . In words, the channel is sensed if and only if the following two conditions are satisfied: a) the channel’s \IRIis greater than the sensing cost and less than the interfering cost minus the sensing cost; and b) the uncertainty on the primary occupancy is higher than that obtained from a unique, isolated measurement.
Results are plotted in Fig. 1. The slight lack of monotonicity observed in the curves is due to the fact that simulations have been run using a MonteCarlo approach. As expected, the optimal sensing scheme achieves the best performance for all test cases. Moreover, Figs. 1 and 1 reveal that the horizon1 value function approximation constitutes a good approximation to the optimal value function in two cases: i) when the expected transition time is short (low time correlation) and ii) when the sensing cost is relatively small. The performance of the myopic policy is shown to be far from the optimal. This finding is in disagreement with the results obtained for simpler models in the opportunistic spectrum access literature [28] where it was suggested that the myopic policy could be a good approximation to solve the associated \POMDPefficiently. The reason can be that the \CRmodels considered were substantially different (the \RAschemes in this paper are more complex and the interference constraints are formulated differently). In fact, the only cases where the myopic policy seems to approximate the optimal performance are: i) if , this is expected because then the optimal policy is to sense at every time instant; and ii) if the \PUs activity is not correlated across time (which was the assumption in [24]).
Fig. 1 suggests that the benefits of implementing the optimal sensing policies are stronger when sensors are inaccurate. In other words, the proposed schemes can help to soften the negative impact of deploying low quality (cheap) sensing devices. Finally, results in Fig. 1 also suggest that changes in the average \SNRbetween \SU and \NC, have similar effects on the performance of all analyzed schemes.
Test case 2: Sensing decision maps. To gain insights on the behavior of the optimal sensing schemes, Fig. 2 plots the sensing decisions as a function of and . Simulations are run using the parameters for the default test case (see Table I) and each subplot corresponds to a different channel . Since the domain of the sensing decision function is two dimensional, the function itself can be efficiently represented as an image (map). To primary regions are identified, one corresponding to the pairs which give rise to , and one corresponding to the pairs giving rise to . Moreover, the region where is split into two subregions, the first one corresponding to (i.e., when there is a user accessing the channel) and the second one when (i.e., when the system decides that no user will access the channel). Note that for the region where , the access decision basically depends on the outcome of the sensing process (if fact, it can be rigorously shown that