Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control

Abstract

Many real-world sequential decision-making problems can be formulated as optimal control with high-dimensional observations and unknown dynamics. A promising approach is to embed the high-dimensional observations into a lower-dimensional latent representation space, estimate the latent dynamics model, then utilize this model for control in the latent space. An important open question is how to learn a representation that is amenable to existing control algorithms? In this paper, we focus on learning representations for locally-linear control algorithms, such as iterative LQR (iLQR). By formulating and analyzing the representation learning problem from an optimal control perspective, we establish three underlying principles that the learned representation should comprise: 1) accurate prediction in the observation space, 2) consistency between latent and observation space dynamics, and 3) low curvature in the latent space transitions. These principles naturally correspond to a loss function that consists of three terms: prediction, consistency, and curvature (PCC). Crucially, to make PCC tractable, we derive an amortized variational bound for the PCC loss function. Extensive experiments on benchmark domains demonstrate that the new variational-PCC learning algorithm benefits from significantly more stable and reproducible training, and leads to superior control performance. Further ablation studies give support to the importance of all three PCC components for learning a good latent space for control.

\iclrfinalcopy

1 Introduction

Decomposing the problem of decision-making in an unknown environment into estimating dynamics followed by planning provides a powerful framework for building intelligent agents. This decomposition confers several notable benefits. First, it enables the handling of sparse-reward environments by leveraging the dense signal of dynamics prediction. Second, once a dynamics model is learned, it can be shared across multiple tasks within the same environment. While the merits of this decomposition have been demonstrated in low-dimensional environments (Deisenroth and Rasmussen, 2011; Gal et al., 2016), scaling these methods to high-dimensional environments remains an open challenge.

The recent advancements in generative models have enabled the successful dynamics estimation of high-dimensional decision processes (Watter et al., 2015; Ha and Schmidhuber, 2018; Kurutach et al., 2018). This procedure of learning dynamics can then be used in conjunction with a plethora of decision-making techniques, ranging from optimal control to reinforcement learning (RL) (Watter et al., 2015; Banijamali et al., 2018; Finn et al., 2016; Chua et al., 2018; Ha and Schmidhuber, 2018; Kaiser et al., 2019; Hafner et al., 2018; Zhang et al., 2019). One particularly promising line of work in this area focuses on learning the dynamics and conducting control in a low-dimensional latent embedding of the observation space, where the embedding itself is learned through this process (Watter et al., 2015; Banijamali et al., 2018; Hafner et al., 2018; Zhang et al., 2019). We refer to this approach as learning controllable embedding (LCE). There have been two main approaches to this problem: 1) to start by defining a cost function in the high-dimensional observation space and learn the embedding space, its dynamics, and reward function, by interacting with the environment in a RL fashion (Hafner et al., 2018; Zhang et al., 2019), and 2) to first learn the embedding space and its dynamics, and then define a cost function in this low-dimensional space and conduct the control (Watter et al., 2015; Banijamali et al., 2018). This can be later combined with RL for extra fine-tuning of the model and control.

In this paper, we take the second approach and particularly focus on the important question of what desirable traits should the latent embedding exhibit for it to be amenable to a specific class of control/learning algorithms, namely the widely used class of locally-linear control (LLC) algorithms? We argue from an optimal control standpoint that our latent space should exhibit three properties. The first is prediction: given the ability to encode to and decode from the latent space, we expect the process of encoding, transitioning via the latent dynamics, and then decoding, to adhere to the true observation dynamics. The second is consistency: given the ability to encode a observation trajectory sampled from the true environment, we expect the latent dynamics to be consistent with the encoded trajectory. Finally, curvature: in order to learn a latent space that is specifically amenable to LLC algorithms, we expect the (learned) latent dynamics to exhibit low curvature in order to minimize the approximation error of its first-order Taylor expansion employed by LLC algorithms. Our contributions are thus as follows: (1) We propose the Prediction, Consistency, and Curvature (PCC) framework for learning a latent space that is amenable to LLC algorithms and show that the elements of PCC arise systematically from bounding the suboptimality of the solution of the LLC algorithm in the latent space. (2) We design a latent variable model that adheres to the PCC framework and derive a tractable variational bound for training the model. (3) To the best of our knowledge, our proposed curvature loss for the transition dynamics (in the latent space) is novel. We also propose a direct amortization of the Jacobian calculation in the curvature loss to help training with curvature loss more efficiently. (4) Through extensive experimental comparison, we show that the PCC model consistently outperforms E2C (Watter et al., 2015) and RCE (Banijamali et al., 2018) on a number of control-from-images tasks, and verify via ablation, the importance of regularizing the model to have consistency and low-curvature.

2 Problem Formulation

We are interested in controlling the non-linear dynamical systems of the form , over the horizon . In this definition, and are the state and action of the system at time step , is the Gaussian system noise, and is a smooth non-linear system dynamics. We are particularly interested in the scenario in which we only have access to the high-dimensional observation of each state (). This scenario has application in many real-world problems, such as visual-servoing (Espiau et al., 1992), in which we only observe high-dimensional images of the environment and not its underlying state. We further assume that the high-dimensional observations have been selected such that for any arbitrary control sequence , the observation sequence is generated by a stationary Markov process, i.e., .1

A common approach to control the above dynamical system is to solve the following stochastic optimal control (SOC) problem (Shapiro et al., 2009) that minimizes expected cumulative cost:

(SOC1)

where is the immediate cost function at time , is the terminal cost, and is the observation at the initial state . Note that all immediate costs are defined in the observation space , and are bounded by and Lipschitz with constant . For example, in visual-servoing, (SOC1) can be formulated as a goal tracking problem (Ebert et al., 2018), where we control the robot to reach the goal observation , and the objective is to compute a sequence of optimal open-loop actions that minimizes the cumulative tracking error .

Since the observations are high dimensional and the dynamics in the observation space is unknown, solving (SOC1) is often intractable. To address this issue, a class of algorithms has been recently developed that is based on learning a low-dimensional latent (embedding) space () and latent state dynamics, and performing optimal control there. This class that we refer to as learning controllable embedding (LCE) throughout the paper, include recently developed algorithms, such as E2C (Watter et al., 2015), RCE (Banijamali et al., 2018), and SOLAR (Zhang et al., 2019). The main idea behind the LCE approach is to learn a triplet, (i) an encoder ; (ii) a dynamics in the latent space ; and (iii) a decoder . These in turn can be thought of as defining a (stochastic) mapping of the form . We then wish to solve the SOC in latent space :

(SOC2)

such that the solution of (SOC2), , has similar performance to that of (SOC1), , i.e., . In (SOC2), is the initial latent state sampled from the encoder ; is the latent cost function defined as ; is a regularizer over the mapping ; and is the corresponding regularization parameter. We will define and more precisely in Section 3. Note that the expectation in (SOC2) is over the randomness generated by the (stochastic) encoder .

3 PCC Model: A Control Perspective

E

E

E

P

P

E

F

F

E

F

F

E

D

D

E

Figure 1: Evolution of the states (a)(blue) in equation SOC1 under dynamics , (b)(green) in equation SOC2 under dynamics , and (c)(red) in equation SOC3 under dynamics .

As described in Section 2, we are primarily interested in solving (SOC1), whose states evolve under dynamics , as shown at the bottom row of Figure 1(a) in (blue). However, because of the difficulties in solving (SOC1), mainly due to the high dimension of observations , LCE proposes to learn a mapping by solving (SOC2) that consists of a loss function, whose states evolve under dynamics (after an initial transition by encoder ), as depicted in Figure 1(b), and a regularization term. The role of the regularizer is to account for the performance gap between (SOC1) and the loss function of (SOC2), due to the discrepancy between their evolution paths, shown in Figures 1(a)(blue) and 1(b)(green). The goal of LCE is to learn of the particular form , described in Section 2, such that the solution of (SOC2) has similar performance to that of (SOC1). In this section, we propose a principled way to select the regularizer to achieve this goal. Since the exact form of (SOC2) has a direct effect on learning , designing this regularization term, in turn, provides us with a recipe (loss function) to learn the latent (embedded) space . In the following subsections, we show that this loss function consists of three terms that correspond to prediction, consistency, and curvature, the three ingredients of our PCC model.

Note that these two SOCs evolve in two different spaces, one in the observation space under dynamics , and the other one in the latent space (after an initial transition from to ) under dynamics . Unlike and that only operate in a single space, and , respectively, can govern the evolution of the system in both and (see Figure 1(c)). Therefore, any recipe to learn , and as a result the latent space , should have at least two terms, to guarantee that the evolution paths resulted from in and are consistent with those generated by and . We derive these two terms, that are the prediction and consistency terms in the loss function used by our PCC model, in Sections 3.1 and 3.2, respectively. While these two terms are the result of learning in general SOC problems, in Section 3.3, we concentrate on the particular class of LLC algorithms (e.g., iLQR (Li and Todorov, 2004)) to solve SOC, and add the third term, curvature, to our recipe for learning .

3.1 Prediction of the Next Observation

Figures 1(a)(blue) and 1(c)(red) show the transition in the observation space under and , where is the current observation, and and are the next observations under these two dynamics, respectively. Instead of learning a with minimum mismatch with in terms of some distribution norm, we propose to learn by solving the following SOC:

(SOC3)

whose loss function is the same as the one in (SOC1), with the true dynamics replaced by . In Lemma 3.1 (see Appendix A.1, for proof), we show how to set the regularization term in (SOC3), such that the control sequence resulted from solving (SOC3), , has similar performance to the solution of (SOC1), , i.e., .

{lemma}

Let be a solution to (SOC1) and be a solution to (SOC3) with

(1)

Then, we have .

In Eq. 1, the expectation is over the state-action stationary distribution of the policy used to generate the training samples (uniformly random policy in this work), and is the Lebesgue measure of .2

3.2 Consistency in Prediction of the Next Latent State

In Section 3.1, we provided a recipe for learning (in form of ) by introducing an intermediate (SOC3) that evolves in the observation space according to dynamics . In this section we first connect (SOC2) that operates in with (SOC3) that operates in . For simplicity and without loss generality, assume the initial cost is zero.3 Lemma 3.2 (see Appendix A.2, for proof) suggests how we shall set the regularizer in (SOC2), such that its solution performs similarly to that of (SOC3), under their corresponding dynamics models.

{lemma}

Let be a solution to (SOC3) and be a solution to (SOC2) with

(2)

Then, we have .

Similar to Lemma 3.1, in Eq. 2, the expectation is over the state-action stationary distribution of the policy used to generate the training samples. Moreover, and are the probability over the next latent state , given the current observation and action , in (SOC2) and (SOC3) (see the paths and in Figures 1(b)(green) and 1(c)(red)). Therefore can be interpreted as the measure of discrepancy between these models, which we term as consistency loss.

Although Lemma 3.2 provides a recipe to learn by solving (SOC2) with the regularizer (2), unfortunately this regularizer cannot be computed from the data – that is of the form – because the first term in the requires marginalizing over current and next latent states ( and in Figure 1(c)). To address this issue, we propose to use the (computable) regularizer

(3)

in which the expectation is over sampled from the training data. Corollary 3.2 (see Appendix A.3, for proof) bounds the performance loss resulted from using instead of , and shows that it could be still a reasonable choice.

{corollary}

Let be a solution to (SOC3) and be a solution to (SOC2) with and and defined by (3) and (2). Then, we have . Lemma 3.1 suggests a regularizer to connect the solutions of (SOC1) and (SOC3). Similarly, Corollary 3.2 shows that regularizer in (3) establishes a connection between the solutions of (SOC3) and (SOC2). Putting these results together, we achieve our goal in Lemma 3.2 (see Appendix A.4, for proof) to design a regularizer for (SOC2), such that its solution performs similarly to that of (SOC1).

{lemma}

Let be a solution to (SOC1) and be a solution to (SOC2) with

(4)

where and are defined by (1) and (3). Then, we have

3.3 Locally-Linear Control in the Latent Space and Curvature Regularization

In Sections 3.1 and 3.2, we derived a loss function to learn the latent space . This loss function, that was motivated by the general SOC perspective, consists of two terms to enforce the latent space to not only predict the next observations accurately, but to be suitable for control. In this section, we focus on the class of locally-linear control (LLC) algorithms (e.g., iLQR), for solving (SOC2), and show how this choice adds a third term, that corresponds to curvature, to the regularizer of (SOC2), and as a result, to the loss function of our PCC model.

The main idea in LLC algorithms is to iteratively compute an action sequence to improve the current trajectory, by linearizing the dynamics around this trajectory, and use this action sequence to generate the next trajectory (see Appendix B for more details about LLC and iLQR). This procedure implicitly assumes that the dynamics is approximately locally linear. To ensure this in (SOC2), we further restrict the dynamics and assume that it is not only of the form , but , the latent space dynamics, has low curvature. One way to ensure this in (SOC2) is to directly impose a penalty over the curvature of the latent space transition function . Assume , where is a Gaussian noise. Consider the following SOC problem:

(SOC-LLC)

where is defined by (4); is optimized by a LLC algorithm, such as iLQR; is given by,

(5)

where , is a tunable parameter that characterizes the “diameter” of latent state-action space in which the latent dynamics model has low curvature. , where is the minimum non-zero measure of the sample distribution w.r.t. , and is a probability threshold. Lemma 3.3 (see Appendix A.5, for proof and discussions on how affects LLC performance) shows that a solution of (SOC-LLC) has similar performance to a solution of (SOC1, and thus, (SOC-LLC) is a reasonable optimization problem to learn , and also the latent space . {lemma} Let be a LLC solution to (SOC-LLC) and be a solution to (SOC1). Suppose the nominal latent state-action trajectory satisfies the condition: , where is the optimal trajectory of (SOC2). Then with probability , we have .

In practice, instead of solving (SOC-LLC) jointly for and , we treat (SOC-LLC) as a bi-level optimization problem, first, solve the inner optimization problem for , i.e.,

(PCC-LOSS)

where is the negative log-likelihood,4 and then, solve the outer optimization problem, , where , to obtain the optimal control sequence . Solving (SOC-LLC) this way is an approximation, in general, but is justified, when the regularization parameter is large. Note that we leave the regularization parameters as hyper-parameters of our algorithm, and do not use those derived in the lemmas of this section. Since the loss for learning in (PCC-LOSS) enforces (i) prediction accuracy, (ii) consistency in latent state prediction, and (iii) low curvature over , through the regularizers , , and , respectively, we refer to it as the prediction-consistency-curvature (PCC) loss.

4 Instantiating the PCC Model in Practice

The PCC-Model objective in (PCC-LOSS) introduces the optimization problem . To instantiate this model in practice, we describe as a latent variable model that factorizes as In this section, we propose a variational approximation to the intractable negative log-likelihood and batch-consistency losses, and an efficient approximation of the curvature loss .

4.1 Variational PCC

The negative log-likelihood 5 admits a variational bound via Jensen’s Inequality,

(6)

which holds for any choice of recognition model . For simplicity, we assume the recognition model employs bottom-up inference and thus factorizes as . The main idea behind choosing a backward-facing model is to allow the model to learn to account for noise in the underlying dynamics. We estimate the expectations in (6) via Monte Carlo simulation. To reduce the variance of the estimator, we decompose further into

and note that the Entropy and Kullback-Leibler terms are analytically tractable when is restricted to a suitably chosen variational family (i.e. in our experiments, and are factorized Gaussians). The derivation is provided in Appendix C.1.

Interestingly, the consistency loss admits a similar treatment. We note that the consistency loss seeks to match the distribution of with , which we represent below as

Here, is intractable due to the marginalization of . We employ the same procedure as in (6) to construct a tractable variational bound

We now make the further simplifying assumption that . This allows us to rewrite the expression as

(7)

which is a subset of the terms in (6). See Appendix C.2 for a detailed derivation.

4.2 Curvature Regularization and Amortized Gradient

In practice we use a variant of the curvature loss where Taylor expansions and gradients are evaluated at and ,

(8)

When is large, evaluation and differentiating through the Jacobians can be slow. To circumvent this issue, the Jacobians evaluation can be amortized by treating the Jacobians as the coefficients of the best linear approximation at the evaluation point. This leads to a new amortized curvature loss

(9)

where and are function approximators to be optimized. Intuitively, the amortized curvature loss seeks—for any given —to find the best choice of linear approximation induced by and such that the behavior of in the neighborhood of is approximately linear.

5 Relation to Previous Embed-to-Control Approaches

In this section, we highlight the key differences between PCC and the closest previous works, namely E2C and RCE. A key distinguishing factor is PCC’s use of a nonlinear latent dynamics model paired with an explicit curvature loss. In comparison, E2C and RCE both employed “locally-linear dynamics” of the form where and are auxiliary random variables meant to be perturbations of and . When contrasted with  (9), it is clear that neither and in the E2C/RCE formulation can be treated as the Jacobians of the dynamics, and hence the curvature of the dynamics is not being controlled explicitly. Furthermore, since the locally-linear dynamics are wrapped inside the maximum-likelihood estimation, both E2C and RCE conflate the two key elements prediction and curvature together. This makes controlling the stability of training much more difficult. Not only does PCC explicitly separate these two components, we are also the first to explicitly demonstrate theoretically and empirically that the curvature loss is important for iLQR.

Furthermore, RCE does not incorporate PCC’s consistency loss. Note that PCC, RCE, and E2C are all Markovian encoder-transition-decoder frameworks. Under such a framework, the sole reliance on minimizing the prediction loss will result in a discrepancy between how the model is trained (maximizing the likelihood induced by encoding-transitioning-decoding) versus how it is used at test-time for control (continual transitioning in the latent space without ever decoding). By explicitly minimizing the consistency loss, PCC reduces the discrapancy between how the model is trained versus how it is used at test-time for planning. Interestingly, E2C does include a regularization term that is akin to PCC’s consistency loss. However, as noted by the authors of RCE, E2C’s maximization of pair-marginal log-likelihoods of as opposed to the conditional likelihood of given means that E2C does not properly minimize the prediction loss prescribed by the PCC framework.

6 Experiments

Figure 2: Top: Planar latent representations; Bottom: Inverted Pendulum latent representations (randomly selected): left two: RCE, middle two: E2C, right two: PCC.

In this section, we compare the performance of PCC with two model-based control algorithm baselines: RCE6 (Banijamali et al., 2018) and E2C (Watter et al., 2015), as well as running a thorough ablation study on various components of PCC. The experiments are based on the following continuous control benchmark domains (see Appendix D for more descriptions): (i) Planar System, (ii) Inverted Pendulum, (iii) Cartpole, (iv) -link manipulator, and (v) TORCS simulator7 (Wymann et al., 2000).

To generate our training and test sets, each consists of triples , we: (1) sample an underlying state and generate its corresponding observation , (2) sample an action , and (3) obtain the next state according to the state transition dynamics, add it a zero-mean Gaussian noise with variance , and generate corresponding observation .To ensure that the observation-action data is uniformly distributed (see Section 3), we sample the state-action pair uniformly from the state-action space. To understand the robustness of each model, we consider both deterministic () and stochastic scenarios. In the stochastic case, we add noise to the system with different values of and evaluate the models’ performance under various degree of noise.

Each task has underlying start and goal states that are unobservable to the algorithms, instead, the algorithms have access to the corresponding start and goal observations. We apply control using the iLQR algorithm (see Appendix B), with the same cost function that was used by RCE and E2C, namely, , and , where is obtained by encoding the goal observation, and , 8. Details of our implementations are specified in Appendix D.3. We report performance in the underlying system, specifically the percentage of time spent in the goal region9.

A Reproducible Experimental Pipeline In order to measure performance reproducibility, we perform the following 2-step pipeline. For each control task and algorithm, we (1) train models independently, and (2) solve control tasks per model (we do not cherry-pick, but instead perform a total of control tasks). We report statistics averaged over all the tasks (in addition, we report the best performing model averaged over its tasks). By adopting a principled and statistically reliable evaluation pipeline, we also address a pitfall of the compared baselines where the best model needs to be cherry picked, and training variance was not reported.

Results Table 1 shows how PCC outperforms the baseline algorithms in the noiseless dynamics case by comparing means and standard deviations of the means on the different control tasks (for the case of added noise to the dynamics, which exhibits similar behavior, refer to Appendix E.1). It is important to note that for each algorithm, the performance metric averaged over all models is drastically different than that of the best model, which justifies our rationale behind using the reproducible evaluation pipeline and avoid cherry-picking when reporting. Figure 2 depicts instances (randomly chosen from the trained models) of the learned latent space representations on the noiseless dynamics of Planar and Inverted Pendulum tasks for PCC, RCE, and E2C models (additional representations can be found in Appendix E.2). Representations were generated by encoding observations corresponding to a uniform grid over the state space. Generally, PCC has a more interpretable representation of both Planar and Inverted Pendulum Systems than other baselines for both the noiseless dynamics case and the noisy case. Finally, in terms of computation, PCC demonstrates faster training with improvement over RCE, and improvement over E2C.10

Domain RCE (all) E2C (all) PCC (all) RCE (top 1) E2C (top 1) PCC (top 1)
Planar 2.1 0.8 5.5 1.7 35.7 3.4 9.2 1.4 36.5 3.6 72.1 0.4
Pendulum 24.7 3.1 46.8 4.1 58.7 3.7 68.8 2.2 89.7 0.5 90.3 0.4
Cartpole 59.5 4.1 7.3 1.5 54.3 3.9 99.45 0.1 40.2 3.2 93.9 1.7
3-link 1.1 0.4 4.7 1.1 18.8 2.1 10.6 0.8 20.9 0.8 47.2 1.7
TORCS 27.4 1.8 28.2 1.9 60.7 1.1 39.9 2.2 54.1 2.3 68.6 0.4
Table 1: Percentage of steps in goal state. Averaged over all models (left), and the best model (right).
Domain PCC PCC no Con PCC no Cur PCC Amor
Planar 35.7 3.4 0.0 0.0 29.6 3.5 41.7 3.7
Pendulum 58.7 3.7 52.3 3.5 50.3 3.3 54.2 3.1
Cartpole 54.3 3.9 5.1 0.4 17.4 1.6 14.3 1.2
3-link 18.8 2.1 9.1 1.5 13.1 1.9 11.5 1.8
Table 2: Ablation analysis. Percentage of steps spent in goal state. From left to right: PCC including all loss terms, excluding consistency loss, excluding curvature loss, amortizing the curvature loss.

Ablation Analysis On top of comparing the performance of PCC to the baselines, in order to understand the importance of each component in (PCC-LOSS), we also perform an ablation analysis on the consistency loss (with/without consistency loss) and the curvature loss (with/without curvature loss, and with/without amortization of the Jacobian terms). Table 2 shows the ablation analysis of PCC on the aforementioned tasks. From the numerical results, one can clearly see that when consistency loss is omitted, the control performance degrades. This corroborates with the theoretical results in Section 3.2, which indicates the relationship of the consistency loss and the estimation error between the next-latent dynamics prediction and the next-latent encoding. This further implies that as the consistency term vanishes, the gap between control objective function and the model training loss is widened, due to the accumulation of state estimation error. The control performance also decreases when one removes the curvature loss. This is mainly attributed to the error between the iLQR control algorithm and (SOC2). Although the latent state dynamics model is parameterized with neural networks, which are smooth, without enforcing the curvature loss term the norm of the Hessian (curvature) might still be high. This also confirms with the analysis in Section 3.3 about sub-optimality performance and curvature of latent dynamics. Finally, we observe that the performance of models trained without amortized curvature loss are slightly better than with their amortized counterpart, however, since the amortized curvature loss does not require computing gradient of the latent dynamics (which means that in stochastic optimization one does not need to estimate its Hessian), we observe relative speed-ups in model training with the amortized version (speed-up of , , and for Planar System, Inverted Pendulum, and Cartpole, respectively).

7 Conclusion

In this paper, we argue from first principles that learning a latent representation for control should be guided by good prediction in the observation space and consistency between latent transition and the embedded observations. Furthermore, if variants of iterative LQR are used as the controller, the low-curvature dynamics is desirable. All three elements of our PCC models are critical to the stability of model training and the performance of the in-latent-space controller. We hypothesize that each particular choice of controller will exert different requirement for the learned dynamics. A future direction is to identify and investigate the additional bias for learning an effective embedding and latent dynamics for other type of model-based control and planning methods.

Appendix A Technical Proofs of Section 3

a.1 Proof of Lemma 3.1

Following analogous derivations of Lemma 11 in Petrik et al. (2016) (for the case of finite-horizon MDPs), for the case of finite-horizon MDPs, one has the following chain of inequalities for any given control sequence and initial observation :

where is the total variation distance of two distributions. The first inequality is based on the result of the above lemma, the second inequality is based on Pinsker’s inequality (Ordentlich and Weinberger, 2005), and the third inequality is based on Jensen’s inequality (Boyd and Vandenberghe, 2004) of function.

Now consider the expected cumulative KL cost: with respect to some arbitrary control action sequence . Notice that this arbitrary action sequence can always be expressed in form of deterministic policy with some non-stationary state-action mapping . Therefore, this KL cost can be written as:

(10)

where the expectation is taken over the state-action occupation measure of the finite-horizon problem that is induced by data-sampling policy . The last inequality is due to change of measures in policy, and the last inequality is due to the facts that (i) is a deterministic policy, (ii) is a sampling policy with lebesgue measure over all control actions, (iii) the following bounds for importance sampling factor holds: .

To conclude the first part of the proof, combining all the above arguments we have the following inequality for any model and control sequence :

(11)

For the second part of the proof, consider the solution of (SOC3), namely . Using the optimality condition of this problem one obtains the following inequality:

(12)

Using the results in (11) and (12), one can then show the following chain of inequalities:

(13)

where is the optimizer of (SOC1) and is the optimizer of (SOC3).

Therefore by letting and and by combining all of the above arguments, the proof of the above lemma is completed.

a.2 Proof of Lemma 3.2

For the first part of the proof, at any time-step , for any arbitrary control action sequence , and any model , consider the following decomposition of the expected cost :