Solving Factored MDPs with Hybrid State and Action Variables

# Solving Factored MDPs with Hybrid State and Action Variables

\nameBranislav Kveton \emailbkveton@cs.pitt.edu
5406 Sennott Square
University of Pittsburgh
Pittsburgh, PA 15260 \AND\nameMilos Hauskrecht \emailmilos@cs.pitt.edu
5329 Sennott Square
University of Pittsburgh
Pittsburgh, PA 15260 \AND\nameCarlos Guestrin \emailguestrin@cs.cmu.edu
Computer Science Department
5313 Wean Hall
Carnegie Mellon University
Pittsburgh, PA 15213
###### Abstract

Efficient representations and solutions for large decision problems with continuous and discrete variables are among the most important challenges faced by the designers of automated decision support systems. In this paper, we describe a novel hybrid factored Markov decision process (MDP) model that allows for a compact representation of these problems, and a new hybrid approximate linear programming (HALP) framework that permits their efficient solutions. The central idea of HALP is to approximate the optimal value function by a linear combination of basis functions and optimize its weights by linear programming. We analyze both theoretical and computational aspects of this approach, and demonstrate its scale-up potential on several hybrid optimization problems.

Solving Factored MDPs with Hybrid State and Action Variables Branislav Kveton bkveton@cs.pitt.edu
Intelligent Systems Program
5406 Sennott Square
University of Pittsburgh
Pittsburgh, PA 15260
Milos Hauskrecht milos@cs.pitt.edu
Department of Computer Science
5329 Sennott Square
University of Pittsburgh
Pittsburgh, PA 15260
Carlos Guestrin guestrin@cs.cmu.edu
Machine Learning Department and
Computer Science Department
5313 Wean Hall
Carnegie Mellon University
Pittsburgh, PA 15213

## 1 Introduction

A dynamic decision problem with components of uncertainty can be very often formulated as a Markov decision process (MDP). An MDP represents a controlled stochastic process whose dynamics is described by state transitions. Objectives of the control are modeled by rewards (or costs), which are assigned to state-action configurations. In the simplest form, the states and actions of an MDP are discrete and unstructured. These models can be solved efficiently by standard dynamic programming methods (?, ?, ?).

Unfortunately, textbook models rarely meet the practice and its needs. First, real-world decision problems are naturally described in a factored form and may involve a combination of discrete and continuous variables. Second, there are no guarantees that compact forms of the optimal value function or policy for these problems exist. Therefore, hybrid optimization problems are usually discretized and solved approximately by the methods for discrete-state MDPs. The contribution of this work is a principled, sound, and efficient approach to solving large-scale factored MDPs that avoids this discretization step.

Our framework is based on approximate linear programming (ALP) (?), which has been already applied to solve decision problems with discrete state and action variables efficiently (?, ?, ?). These applications include context-specific planning (?), multiagent planning (?), relational MDPs (?), and first-order MDPs (?). In this work, we show how to adapt ALP to solving large-scale factored MDPs in hybrid state and action spaces.

The presented approach combines factored MDP representations (Sections 3 and 4) and optimization techniques for solving large-scale structured linear programs (Section 6). This leads to various benefits. First, the quality and complexity of value function approximations is controlled by using basis functions (Section 3.2). Therefore, we can prevent an exponential blowup in the complexity of computations when other techniques cannot. Second, we always guarantee that HALP returns a solution. Its quality naturally depends on the choice of basis functions. As analyzed in Section 5.1, if these are selected appropriately, we achieve a close approximation to the optimal value function . Third, a well-chosen class of basis functions yields closed-form solutions to the backprojections of our value functions (Section 5.2). This step is important for solving hybrid optimization problems more efficiently. Finally, solving hybrid factored MDPs reduces to building and satisfying relaxed formulations of the original problem (Section 6). The formulations can be solved efficiently by the cutting plane method, which has been studied extensively in applied mathematics and operations research.

For better readability of the paper, our proofs are deferred to Appendix A. The following notation is adopted throughout the work. Sets and their members are represented by capital and small italic letters as and , respectively. Sets of variables, their subsets, and members of these sets are denoted by capital letters as , , and . In general, corresponding small letters represent value assignments to these objects. The subscripted indices and denote the discrete and continuous variables in a variable set and its value assignment. The function computes the domain of a variable or the domain of a function. The function returns the parent set of a variable in a graphical model (?, ?).

## 2 Markov Decision Processes

Markov decision processes (?) provide an elegant mathematical framework for modeling and solving sequential decision problems in the presence of uncertainty. Formally, a finite-state Markov decision process (MDP) is given by a 4-tuple , where is a set of states, is a set of actions, is a stochastic transition function of state dynamics conditioned on the preceding state and action, and is a reward function assigning immediate payoffs to state-action configurations. Without loss of generality, the reward function is assumed to be nonnegative and bounded from above by a constant (?). Moreover, we assume that the transition and reward models are stationary and known a priori.

Once a decision problem is formulated as an MDP, the goal is to find a policy that maximizes some objective function. In this paper, the quality of a policy is measured by the infinite horizon discounted reward:

 Eπ[∞∑t=0γtR(s(t),π(s(t)))∣∣ ∣∣s(0)∼φ], (1)

where is a discount factor, is the state at the time step , and the expectation is taken with respect to all state-action trajectories that start in the states and follow the policy thereafter. The states are chosen according to a distribution . This optimality criterion assures that there exists an optimal policy which is stationary and deterministic (?). The policy is greedy with respect to the optimal value function , which is a fixed point of the Bellman equation (?):

 V∗(s)=maxa[R(s,a)+γ∑s′P(s′∣s,a)V∗(s′)]. (2)

The Bellman equation plays a fundamental role in all dynamic programming (DP) methods for solving MDPs (?, ?), including value iteration, policy iteration, and linear programming. The focus of this paper is on linear programming methods and their refinements. Briefly, it is well known that the optimal value function is a solution to the linear programming (LP) formulation (?):

 minimize ∑sψ(s)V(s) (3) subject to: V(s)≥R(s,a)+γ∑s′P(s′∣s,a)V(s′)∀ s∈S,a∈A;

where represents the variables in the LP, one for each state , and is a strictly positive weighting on the state space . The number of constraints equals to the cardinality of the cross product of the state and action spaces .

Linear programming and its efficient solutions have been studied extensively in applied mathematics and operations research (?). The simplex algorithm is a common way of solving LPs. Its worst-case time complexity is exponential in the number of variables. The ellipsoid method (?) offers polynomial time guarantees but it is impractical for solving LPs of even moderate size.

The LP formulation (3) can be solved compactly by the cutting plane method (?) if its objective function and constraint space are structured. Briefly, this method searches for violated constraints in relaxed formulations of the original LP. In every step, we start with a relaxed solution , find a violated constraint given , add it to the LP, and resolve for a new vector . The method is iterated until no violated constraint is found, so that is an optimal solution to the LP. The approach has a potential to solve large structured linear programs if we can identify violated constraints efficiently (?). The violated constraint and the method that found it are often referred to as a separating hyperplane and a separation oracle, respectively.

Delayed column generation is based on a similar idea as the cutting plane method, which is applied to the column space of variables instead of the row space of constraints. Bender’s and Dantzig-Wolfe decompositions reflect the structure in the constraint space and are often used for solving large structured linear programs.

## 3 Discrete-State Factored MDPs

Many real-world decision problems are naturally described in a factored form. Discrete-state factored MDPs (?) allow for a compact representation of this structure.

### 3.1 Factored Transition and Reward Models

A discrete-state factored MDP (?) is a 4-tuple , where is a state space described by a set of state variables, is a set of actions111For simplicity of exposition, we discuss a simpler model, which assumes a single action variable instead of the factored action space . Our conclusions in Sections 3.1 and 3.3 extend to MDPs with factored action spaces (?)., is a stochastic transition model of state dynamics conditioned on the preceding state and action, and is a reward function assigning immediate payoffs to state-action configurations. The state of the system is completely observed and represented by a vector of value assignments . We assume that the values of every state variable are restricted to a finite domain .

Transition model: The transition model is given by the conditional probability distribution , where and denote the state variables at two successive time steps. Since the complete tabular representation of is infeasible, we assume that the transition model factors along as:

 P(X′∣X,a)=n∏i=1P(X′i∣Par(X′i),a) (4)

and can be described compactly by a dynamic Bayesian network (DBN) (?). This DBN representation captures independencies among the state variables and given an action . One-step dynamics of every state variable is modeled by its conditional probability distribution , where denotes the parent set of . Typically, the parent set is a subset of state variables which simplifies the parameterization of the model. In principle, the parent set can be extended to the state variables . Such an extension poses only few new challenges when solving the new problems efficiently (?). Therefore, we omit the discussion on the modeling of intra-layer dependencies in this paper.

Reward model: The reward model factors similarly to the transition model. In particular, the reward function is an additive function of local reward functions defined on the subsets and . In graphical models, the local functions can be described compactly by reward nodes , which are conditioned on their parent sets . To allow this representation, we formally extend our DBN to an influence diagram (?).

###### Example 1 (?)

To illustrate the concept of a factored MDP, we consider a network administration problem, in which the computers are unreliable and fail. The failures of these computers propagate through network connections to the whole network. For instance, if the server (Figure 1a) is down, the chance that the neighboring computer crashes increases. The administrator can prevent the propagation of the failures by rebooting computers that have already crashed.

This network administration problem can be formulated as a factored MDP. The state of the network is completely observable and represented by binary variables , where the variable denotes the state of the -th computer: 0 (being down) or 1 (running). At each time step, the administrator selects an action from the set . The action () corresponds to rebooting the -th computer. The last action is dummy. The transition function reflects the propagation of failures in the network and can be encoded locally by conditioning on the parent set of every computer. A natural metric for evaluating the performance of an administrator is the total number of running computers. This metric factors along the computer states and can be represented compactly by an additive reward function:

 R(x,a)=2x1+n∑j=2xj.

The weighting of states establishes our preferences for maintaining the server and workstations . An example of transition and reward models after taking an action in the 4-ring topology (Figure 1a) is given in Figure 1b.

### 3.2 Solving Discrete-State Factored MDPs

Markov decision processes can be solved by exact DP methods in polynomial time in the size of the state space (?). Unfortunately, factored state spaces are exponential in the number of state variables. Therefore, the DP methods are unsuitable for solving large factored MDPs. Since a factored representation of an MDP (Section 3.1) may not guarantee a structure in the optimal value function or policy (?), we resort to value function approximations to alleviate this concern.

Value function approximations have been successfully applied to a variety of real-world domains, including backgammon (?, ?, ?), elevator dispatching (?), and job-shop scheduling (?, ?). These partial successes suggest that the approximate dynamic programming is a powerful tool for solving large optimization problems.

In this work, we focus on linear value function approximation (?, ?):

 Vw(x)=∑iwifi(x). (5)

The approximation restricts the form of the value function to the linear combination of basis functions , where is a vector of optimized weights. Every basis function can be defined over the complete state space , but usually is limited to a small subset of state variables (?, ?). The role of basis functions is similar to features in machine learning. They are often provided by domain experts, although there is a growing amount of work on learning basis functions automatically (?, ?, ?, ?, ?).

###### Example 2

To demonstrate the concept of the linear value function model, we consider the network administration problem (Example 1) and assume a low chance of a single computer failing. Then the value function in Figure 1c is sufficient to derive a close-to-optimal policy on the 4-ring topology (Figure 1a) because the indicator functions capture changes in the states of individual computers. For instance, if the computer fails, the linear policy:

 u(x)=argmaxa[R(x,a)+γ∑x′P(x′∣x,a)Vw(x′)]

immediately leads to rebooting it. If the failure has already propagated to the computer , the policy recovers it in the next step. This procedure is repeated until the spread of the initial failure is stopped.

### 3.3 Approximate Linear Programming

Various methods for fitting of the linear value function approximation have been proposed and analyzed (?). We focus on approximate linear programming (ALP) (?), which recasts this problem as a linear program:

 minimizew ∑xψ(x)∑iwifi(x) (6) subject to: ∑iwifi(x)≥R(x,a)+γ∑x′P(x′∣x,a)∑iwifi(x′)∀ x∈X,a∈A;

where represents the variables in the LP, are state relevance weights weighting the quality of the approximation, and is a discounted backprojection of the value function (Equation 5). The ALP formulation can be easily derived from the standard LP formulation (3) by substituting for . The formulation is feasible if the set of basis functions contains a constant function . We assume that such a basis function is always present. Note that the state relevance weights are no longer enforced to be strictly positive (Section 1). Comparing to the standard LP formulation (3), which is solved by the optimal value function for arbitrary weights , a solution to the ALP formulation depends on the weights . Intuitively, the higher the weights, the higher the quality of the approximation in a corresponding state.

Since our basis functions are usually restricted to subsets of state variables (Section 3.2), summation terms in the ALP formulation can be computed efficiently (?, ?). For example, the order of summation in the backprojection term can be rearranged as , which allows its aggregation in the space of instead of . Similarly, a factored form of yields an efficiently computable objective function (?).

The number of constraints in the ALP formulation is exponential in the number of state variables. Fortunately, the constraints are structured. This results from combining factored transition and reward models (Section 3.1) with the linear approximation (Equation 5). As a consequence, the constraints can be satisfied without enumerating them exhaustively.

###### Example 3

The notion of a factored constraint space is important for compact satisfaction of exponentially many constraints. To illustrate this concept, let us consider the linear value function (Example 2) on the 4-ring network administration problem (Example 1). Intuitively, by combining the graphical representations of , (Figure 1b), and (Figure 1c), we obtain a factored model of constraint violations:

 τw(x,a1) = Vw(x)−γ∑x′P(x′∣x,a1)Vw(x′)−R(x,a1) = ∑iwifi(x)−γ∑iwi∑x′iP(x′i∣x,a1)fi(x′i)−R(x,a1) = w0+4∑i=1wixi−γw0−γw1P(x′1=1∣a1)− γ4∑i=2wiP(x′i=1∣xi,xi−1,a1)−2x1−4∑j=2xj.

for an arbitrary solution (Figure 2a). Note that this cost function:

 τw(x,a1)=ϕw+4∑i=1ϕw(xi)+4∑i=2ϕw(xi,xi−1)

is a linear combination of a constant in , and univariate and bivariate functions and . It can be represented compactly by a cost network (?), which is an undirected graph over a set of variables . Two nodes in the graph are connected if any of the cost terms depends on both variables. Therefore, the cost network corresponding to the function must contain edges , , and (Figure 2b).

Savings achieved by the compact representation of constraints are related to the efficiency of computing (?). This computation can be done by variable elimination and its complexity increases exponentially in the width of the tree decomposition of the cost network. The smallest width of all tree decompositions is referred to as treewidth.

Inspired by the factorization, ? (?) proposed a variable-elimination method (?) that rewrites the constraint space in ALP compactly. ? (?) solved the same problem by the cutting plane method. The method iteratively searches for the most violated constraint:

 argminx,a⎡⎢⎣∑iw(t)i[fi(xi)−γ∑x′iP(x′i∣x,a)fi(x′i)]−R(x,a)⎤⎥⎦ (7)

with respect to the solution of a relaxed ALP. The constraint is added to the LP, which is resolved for a new solution . This procedure is iterated until no violated constraint is found, so that is an optimal solution to the ALP.

The quality of the ALP formulation has been studied by ? (?). Based on their work, we conclude that ALP yields a close approximation to the optimal value function if the weighted max-norm error can be minimized. We return to this theoretical result in Section 5.1.

###### Theorem 1 (?)

Let be a solution to the ALP formulation (6). Then the expected error of the value function can be bounded as:

 ∥∥V∗−V˜w∥∥1,ψ≤2ψTL1−κminw∥V∗−Vw∥∞,1/L,

where is an -norm weighted by the state relevance weights , is a Lyapunov function such that the inequality holds, denotes its contraction factor, and is a max-norm reweighted by the reciprocal .

Note that the -norm distance equals to the expectation over the state space with respect to the state relevance weights . Similarly to Theorem 1, we utilize the and norms in the rest of the work to measure the expected and worst-case errors of value functions. These norms are defined as follows.

###### Definition 1

The (Manhattan) and (infinity) norms are typically defined as and . If the state space is represented by both discrete and continuous variables and , the definition of the norms changes accordingly:

 ∥f∥1=∑xD∫xC|f(x)|dxC\emphand∥f∥∞=supx|f(x)|. (8)

The following definitions:

 ∥f∥1,ψ=∑xD∫xCψ(x)|f(x)|dxC\emphand∥f∥∞,ψ=supxψ(x)|f(x)| (9)

correspond to the and norms reweighted by a function .

## 4 Hybrid Factored MDPs

Discrete-state factored MDPs (Section 3) permit a compact representation of decision problems with discrete states. However, real-world domains often involve continuous quantities, such as temperature and pressure. A sufficient discretization of these quantities may require hundreds of points in a single dimension, which renders the representation of our transition model (Equation 4) infeasible. In addition, rough and uninformative discretization impacts the quality of policies. Therefore, we want to avoid discretization or defer it until necessary. As a step in this direction, we discuss a formalism for representing hybrid decision problems in the domains of discrete and continuous variables.

### 4.1 Factored Transition and Reward Models

A hybrid factored MDP (HMDP) is a 4-tuple , where is a state space described by state variables, is an action space described by action variables, is a stochastic transition model of state dynamics conditioned on the preceding state and action, and is a reward function assigning immediate payoffs to state-action configurations.222General state and action space MDP is an alternative term for a hybrid MDP. The term hybrid does not refer to the dynamics of the model, which is discrete-time.

State variables: State variables are either discrete or continuous. Every discrete variable takes on values from a finite domain . Following ? (?), we assume that every continuous variable is bounded to the subspace. In general, this assumption is very mild and permits modeling of any closed interval on . The state of the system is completely observed and described by a vector of value assignments which partitions along its discrete and continuous components and .

Action variables: The action space is distributed and represented by action variables . The composite action is defined by a vector of individual action choices which partitions along its discrete and continuous components and .

Transition model: The transition model is given by the conditional probability distribution , where and denote the state variables at two successive time steps. We assume that this distribution factors along as and can be described compactly by a DBN (?). Typically, the parent set is a small subset of state and action variables which allows for a local parameterization of the transition model.

Parameterization of our transition model: One-step dynamics of every state variable is described by its conditional probability distribution . If is a continuous variable, its transition function is represented by a mixture of beta distributions (?):

 P(X′i=x∣Par(X′i)) = ∑jπijPbeta(x∣αj,βj) (10) Pbeta(x∣α,β) = Γ(α+β)Γ(α)Γ(β)xα−1(1−x)β−1,

where is the weight assigned to the -th component of the mixture, and and are arbitrary positive functions of the parent set. The mixture of beta distributions provides a very general class of transition functions and yet allows closed-form solutions333The term closed-form refers to a generally accepted set of closed-form operations and functions extended by the gamma and incomplete beta functions. to the expectation terms in HALP (Section 5). If every , Equation 10 turns into a polynomial in . Due to the Weierstrass approximation theorem (?), such a polynomial is sufficient to approximate any continuous transition density over with any precision. If is a discrete variable, its transition model is parameterized by nonnegative discriminant functions (?):

 P(X′i=j∣Par(X′i))=θj∑∣∣Dom(X′i)∣∣j=1θj. (11)

Note that the parameters , , and (Equations 10 and 11) are functions instantiated by value assignments to the variables . We keep separate parameters for every state variable although our indexing does not reflect this explicitly. The only restriction on the functions is that they return valid parameters for all state-action pairs . Hence, we assume that , , , and .

Reward model: The reward model factors similarly to the transition model. In particular, the reward function is an additive function of local reward functions defined on the subsets and . In graphical models, the local functions can be described compactly by reward nodes , which are conditioned on their parent sets . To allow this representation, we formally extend our DBN to an influence diagram (?). Note that the form of the reward functions is not restricted.

Optimal value function and policy: The optimal policy can be defined greedily with respect to the optimal value function , which is a fixed point of the Bellman equation:

 V∗(x) = supa[R(x,a)+γEP(x′∣x,a)[V∗(x′)]] (12) = supa⎡⎢⎣R(x,a)+γ∑x′D∫x′CP(x′∣x,a)V∗(x′)dx′C⎤⎥⎦.

Accordingly, the hybrid Bellman operator is given by:

 T∗V(x)=supa[R(x,a)+γEP(x′∣x,a)[V(x′)]]. (13)

In the rest of the paper, we denote expectation terms over discrete and continuous variables in a unified form:

 EP(x)[f(x)]=∑xD∫xCP(x)f(x)dxC. (14)
###### Example 4 (?)

Continuous-state network administration is a variation on Example 1, where the computer states are represented by continuous variables on the interval between 0 (being down) and 1 (running). At each time step, the administrator selects a single action from the set . The action () corresponds to rebooting the -th computer. The last action is dummy. The transition model captures the propagation of failures in the network and is encoded locally by beta distributions:

 P(X′i=x∣Par(X′i))=Pbeta(x∣α,β)α=20a=iβ=2α=2+13xi−5xiE[Par(X′i)]a≠iβ=10−2xi−6xiE[Par(X′i)]

where the variables and denote the state of the -th computer and the expected state of its parents. Note that this transition function is similar to Example 1. For instance, in the 4-ring topology, the modes of transition densities for continuous variables and after taking an action (Figure 3):

 ˆP(X′1∣a=a1)=0.95ˆP(X′2∣X2=1,X1=0,a=a1)≈0.67ˆP(X′2∣X2=0,a=a1)=0.10ˆP(X′2∣X2=1,X1=1,a=a1)=0.90

equal to the expected values of their discrete counterparts (Figure 1b). The reward function is additive:

 R(x,a)=2x21+n∑j=2x2j

and establishes our preferences for maintaining the server and workstations .

### 4.2 Solving Hybrid Factored MDPs

Value iteration, policy iteration, and linear programming are the most fundamental dynamic programming methods for solving MDPs (?, ?). Unfortunately, none of these techniques is suitable for solving hybrid factored MDPs. First, their complexity is exponential in the number of state variables if the variables are discrete. Second, the methods assume a finite support for the optimal value function or policy, which may not exist if continuous variables are present. Therefore, any feasible approach to solving arbitrary HMDPs is likely to be approximate. In the rest of the section, we review two major classes of methods for approximating value functions in hybrid domains.

Grid-based approximation: Grid-based methods (?, ?) transform the initial state space into a set of grid points . The points are used to estimate the optimal value function on the grid, which in turn approximates . The Bellman operator on the grid is defined as (?):

 T∗GV(x(i))=maxa[R(x(i),a)+γN∑j=1PG(x(j)∣x(i),a)V(x(j))], (15)

where is a transition function, which is normalized by the term . The operator allows the computation of the value function by standard techniques for solving discrete-state MDPs.

? (?) analyzed the convergence of these methods for random and pseudo-random samples. Clearly, a uniform discretization of increasing precision guarantees the convergence of to but causes an exponential blowup in the state space (?). To overcome this concern, ? (?) proposed an adaptive algorithm for non-uniform discretization based on the Kuhn triangulation. ? (?) analyzed metrics for aggregating states in continuous-state MDPs based on the notion of bisimulation. ? (?) used linear programming to solve low-dimensional problems with continuous variables. These continuous variables were discretized manually.

Parametric value function approximation: An alternative approach to solving factored MDPs with continuous-state components is the approximation of the optimal value function by some parameterized model (?, ?, ?). The parameters are typically optimized iteratively by applying the backup operator to a finite set of states. The least-squares error is a commonly minimized error metric (Figure 4). Online updating by gradient methods (?, ?) is another way of optimizing value functions. The limitation of these techniques is that their solutions are often unstable and may diverge (?). On the other hand, they generate high-quality approximations.

Parametric approximations often assume fixed value function models. However, in some cases, it is possible to derive flexible forms of that combine well with the backup operator . For instance, ? (?) showed that convex piecewise linear functions are sufficient to represent value functions and their DP backups in partially-observable MDPs (POMDPs) (?, ?). Based on this idea, ? (?) proposed a method for solving MDPs with continuous variables. To obtain full DP backups, the value function approximation is restricted to rectangular piecewise linear and convex (RPWLC) functions. Further restrictions are placed on the transition and reward models of MDPs. The advantage of the approach is its adaptivity. The major disadvantages are restrictions on solved MDPs and the complexity of RPWLC value functions, which may grow exponentially in the number of backups. As a result, without further modifications, this approach is less likely to succeed in solving high-dimensional and distributed decision problems.

## 5 Hybrid Approximate Linear Programming

To overcome the limitations of existing methods for solving HMDPs (Section 4.2), we extend the discrete-state ALP (Section 3.3) to hybrid state and action spaces. We refer to this novel framework as hybrid approximate linear programming (HALP).

Similarly to the discrete-state ALP, HALP optimizes the linear value function approximation (Equation 5). Therefore, it transforms an initially intractable problem of computing in the hybrid state space into a lower dimensional space of . The HALP formulation is given by a linear program444More precisely, the HALP formulation (16) is a linear semi-infinite optimization problem with an infinite number of constraints. The number of basis functions is finite. For brevity, we refer to this optimization problem as linear programming.:

 minimizew ∑iwiαi (16) subject to: ∑iwiFi(x,a)−R(x,a)≥0∀ x∈X,a∈A;

where represents the variables in the LP, denotes basis function relevance weight:

 αi = Eψ(x)[fi(x)] (17) = ∑xD∫xCψ(x)fi(x)dxC,

is a state relevance density function that weights the quality of the approximation, and denotes the difference between the basis function and its discounted backprojection:

 gi(x,a) = EP(x′∣x,a)[fi(x′)] (18) = ∑x′D∫x′CP(x′∣x,a)fi(x′)dx′C.

Vectors () and () are the discrete and continuous components of value assignments () to all state variables (). The linear program can be rewritten compactly:

 minimizew Eψ[Vw] (19) subject to: Vw−T∗Vw≥0

by using the Bellman operator .

The HALP formulation reduces to the discrete-state ALP (Section 3.3) if the state and action variables are discrete, and to the continuous-state ALP (?) if the state variables are continuous. The formulation is feasible if the set of basis functions contains a constant function . We assume that such a basis function is present.

In the rest of the paper, we address several concerns related to the HALP formulation. First, we analyze the quality of this approximation and relate it to the minimization of the max-norm error , which is a commonly-used metric (Section 5.1). Second, we present rich classes of basis functions that lead to closed-form solutions to the expectation terms in the objective function and constraints (Equations 17 and 18). These terms involve sums and integrals over the complete state space (Section 5.2), and therefore are hard to evaluate. Finally, we discuss approximations to the constraint space in HALP and introduce a framework for solving HALP formulations in a unified way (Section 6). Note that complete satisfaction of this constraint space may not be possible since every state-action pair induces a constraint.

### 5.1 Error Bounds

The quality of the ALP approximation (Section 3.3) has been studied by ? (?). We follow up on their work and extend it to structured state and action spaces with continuous variables. Before we proceed, we demonstrate that a solution to the HALP formulation (16) constitutes an upper bound on the optimal value function .

###### Proposition 1

Let be a solution to the HALP formulation (16). Then .

This result allows us to restate the objective in HALP.

###### Proposition 2

Vector is a solution to the HALP formulation (16):

 \emphminimizew Eψ[Vw] subject to: Vw−T∗Vw≥0

if and only if it solves:

 \emphminimizew ∥V∗−Vw∥1,ψ subject to: Vw−T∗Vw≥0;

where is an -norm weighted by the state relevance density function and is the hybrid Bellman operator.

Based on Proposition 2, we conclude that HALP optimizes the linear value function approximation with respect to the reweighted -norm error . The following theorem draws a parallel between minimizing this objective and max-norm error . More precisely, the theorem says that HALP yields a close approximation to the optimal value function if is close to the span of basis functions .

###### Theorem 2

Let be an optimal solution to the HALP formulation (16). Then the expected error of the value function can be bounded as:

 ∥∥V∗−V˜w∥∥1,ψ≤21−γminw∥V∗−Vw∥∞,

where is an -norm weighted by the state relevance density function and is a max-norm.

Unfortunately, Theorem 2 rarely yields a tight bound on . First, it is hard to guarantee a uniformly low max-norm error if the dimensionality of a problem grows but the basis functions are local. Second, the bound ignores the state relevance density function although this one impacts the quality of HALP solutions. To address these concerns, we introduce non-uniform weighting of the max-norm error in Theorem 3.

###### Theorem 3

Let be an optimal solution to the HALP formulation (16). Then the expected error of the value function can be bounded as:

 ∥∥V∗−V˜w∥∥1,ψ≤2Eψ[L]1−κminw∥V∗−Vw∥∞,1/L,

where is an -norm weighted by the state relevance density , is a Lyapunov function such that the inequality holds, denotes its contraction factor, and is a max-norm reweighted by the reciprocal .

Note that Theorem 2 is a special form of Theorem 3 when and . Therefore, the Lyapunov function permits at least as good bounds as Theorem 2. To make these bounds tight, the function should return large values in the regions of the state space, which are unimportant for modeling. In turn, the reciprocal is close to zero in these undesirable regions, which makes their impact on the max-norm error less likely. Since the state relevance density function reflects the importance of states, the term should remain small. These two factors contribute to tighter bounds than those by Theorem 2.

Since the Lyapunov function lies in the span of basis functions , Theorem 3 provides a recipe for achieving high-quality approximations. Intuitively, a good set of basis functions always involves two types of functions. The first type guarantees small errors in the important regions of the state space, where the state relevance density is high. The second type returns high values where the state relevance density is low, and vice versa. The latter functions allow the satisfaction of the constraint space in the unimportant regions of the state space without impacting the optimized objective function . Note that a trivial value function satisfies all constraints in any HALP but unlikely leads to good policies. For a comprehensive discussion on selecting appropriate and , refer to the case studies of ? (?).

Our discussion is concluded by clarifying the notion of the state relevance density . As demonstrated by Theorem 4, its choice is closely related to the quality of a greedy policy for the value function (?).

###### Theorem 4

Let be an optimal solution to the HALP formulation (16). Then the expected error of a greedy policy:

 u(x)=argsupa[R(x,a)+γEP(x′∣x,a)[V˜w(x′)]]

can be bounded as:

where and are weighted -norms, is a value function for the greedy policy , and is the expected frequency of state visits generated by following the policy given the initial state distribution .

Based on Theorem 4, we may conclude that the expected error of greedy policies for HALP approximations is bounded when . Note that the distribution is unknown when optimizing because it is a function of the optimized quantity itself. To break this cycle, ? (?) suggested an iterative procedure that solves several LPs and adapts accordingly. In addition, real-world control problems exhibit a lot of structure, which permits the guessing of .

Finally, it is important to realize that although our bounds (Theorems 3 and 4) build a foundation for better HALP approximations, they can be rarely used in practice because the optimal value function is generally unknown. After all, if it was known, there is no need to approximate it. Moreover, note that the optimization of (Theorem 3) is a hard problem and there are no methods that would minimize this error directly (?). Despite these facts, both bounds provide a loose guidance for empirical choices of basis functions. In Section 7, we use this intuition and propose basis functions that should closely approximate unknown optimal value functions .

### 5.2 Expectation Terms

Since our basis functions are often restricted to small subsets of state variables, expectation terms (Equations 17 and 18) in the HALP formulation (16) should be efficiently computable. To unify the analysis of these expectation terms, and , we show that their evaluation constitutes the same computational problem , where denotes some factored distribution.

Before we discuss expectation terms in the constraints, note that the transition function is factored and its parameterization is determined by the state-action pair . We keep the pair fixed in the rest of the section, which corresponds to choosing a single constraint . Based on this selection, we rewrite the expectation terms in a simpler notation , where denotes a factored distribution with fixed parameters.

We also assume that the state relevance density function factors along as:

 ψ(x)=n∏i=1ψi(xi), (20)

where is a distribution over the random state variable . Based on this assumption, we can rewrite the expectation terms in the objective function in a new notation , where denotes a factored distribution. In line with our discussion in the last two paragraphs, efficient solutions to the expectation terms in HALP are obtained by solving the generalized term efficiently. We address this problem in the rest of the section.

Before computing the expectation term over the complete state space , we recall that the basis function is defined on a subset of state variables . Therefore, we may concl