Interplanetary Transfers via Deep Representations of the Optimal Policy and/or of the Value Function
Abstract
A number of applications have been recently proposed based on deep networks and interplanetary trajectories. These approaches often rely on the availability of a large number of optimal trajectories to learn from. In this paper we introduce a new method to quickly create millions of optimal spacecraft trajectories from a single nominal trajectory. Apart from the generation of the nominal trajectory, no additional optimal control problems need to be solved as all the trajectories, by construction, satisfy Pontryagin’s minimum principle and the relevant transversality conditions. We then consider deep feed forward neural networks and benchmark three learning methods on the created dataset: policy imitation, value function learning and value function gradient learning. Our results are shown for the case of the interplanetary trajectory optimization problem of reaching Venus orbit, with the nominal trajectory starting from the Earth. We find that both policy imitation and value function gradient learning are able to learn the optimal state feedback, while in the case of value function learning the optimal policy is not captured, only the final value of the optimal propellant mass is.
Interplanetary Transfers via Deep Representations of the Optimal Policy and/or of the Value Function
Dario Izzo^{†}^{†}thanks: Dario Izzo is the corresponding author, Ekin Öztürk, Marcus Märtens Advanced Concepts Team European Space Agency Noordwijk, 2201 AZ Netherlands dario.izzo@esa.int ekin.ozturk@esa.int marcus.maertens@esa.int
Keywords artificial neural networks, optimization, deep learning, trajectory design
1 Introduction
The use of deep neural networks (DNNs) for the guidance navigation and control of space systems (spacecraft, landers, etc..) is a prolific area of research as witnessed by the increasing number of results that appeared recently on these topics [9, 11, 3, 12, 4, 6, 14]. Most of this body of work can be divided into DNNs approximating the optimal policy \mathbf{u} (e.g. the optimal thrust profile), and DNNs approximating the value function v (e.g. optimal propellant consumption). The final goal of this line of research is to develop DNNs that can be part of the onboard software to steer the spacecraft and substitute the guidance and control loops currently used. Furthermore, accurate predictions of important quantities (e.g. propellant mass) for optimal transfers can provide advantages during preliminary design phases of trajectories where many options are typically screened and evaluated.
While the deployment of DNNs for trajectory design is promising, it has been shown over a large range of different application domains that the performance of DNNs depends heavily on the availability of suitable datasets for training. While most research literature is concerned with improving network architectures (number of layers, regularization, activation functions, etc.) or training procedures, in practice, it is often simply the lack of data that puts a limit on what any DNN architecture is capable of achieving.
The computation of a largescale dataset of trajectorie, including the optimal control profile and the value function is challenging, as a huge number of optimal control problems need to be solved. Since numerical instabilities and convergence issues are dominant, solution procedures are generally hard to automate and demanding in terms of computational resources.
The contribution of this paper is twofold: First, we show how to create a large dataset of optimal trajectories without the need to solve more than one optimal control problem. This decreases the dataset generation time, in comparison to previous methods, by orders of magnitude. Second, we use the dataset to learn several control policies for an interplanetary mission to reach the orbit of Venus.
Building on the work of [9, 12, 10], we propose three different learning tasks, which we call policy imitation, value function learning and value function gradients learning. We are able to approximate, in the former case, the optimal policy (i.e. the optimal control profile) and, in the latter two cases, the value function (i.e. the solution to the corresponding Hamilton Jacobi Bellman equation). In the last case we find that the learned value function can also be satisfactorily used to compute the optimal policy.
This work is structured as follows: Section 2 provides the details of the nominal transfer to Venus designed as a massoptimal trajectory by solving the two point boundary value problem resulting from Pontryagin’s minimum principle. Based on this nominal trajectory, Section 3 describes the new method for the generation of a large dataset of different trajectories to the Venus orbit whose massoptimality is still ensured by Pontryagin’s minimum principle. Section 4 describes different strategies for the training of DNNs on such a dataset. The resulting controllers are evaluated experimentally in Section 5 and we conclude in Section 6.
2 Background
We consider, in the International Celestial Reference Frame (ICRF) heliocentric frame, the motion of a spacecraft of mass m whose velocity and position we indicate with \mathbf{r} and \mathbf{v}. The spacecraft is equipped with an ion thruster having specific impulse I_{sp} and capable of delivering a maximum thrust c_{1} regardless of the spacecraft distance from the Sun. We describe the spacecraft state via its mass and the modified equinoctial elements \mathbf{x}=[p,f,g,h,k,L]^{T} as originally defined by Walker et al. [13].
The set of differential equations describing the spacecraft motion are then:
\begin{array}[]{l}\dot{p}=\sqrt{\frac{p}{\mu}}\frac{2p}{w}f_{t}\\ \dot{f}=\sqrt{\frac{p}{\mu}}\left\{f_{r}\sin L+\left[(1+w)\cos L+f\right]\frac% {f_{t}}{w}(h\sin Lk\cos L)\frac{g\cdot f_{n}}{w}\right\}\\ \dot{g}=\sqrt{\frac{p}{\mu}}\left\{f_{r}\cos L+\left[(1+w)\sin L+g\right]% \frac{f_{t}}{w}+(h\sin Lk\cos L)\frac{f\cdot f_{n}}{w}\right\}\\ \dot{h}=\sqrt{\frac{p}{\mu}}\frac{s^{2}f_{n}}{2w}\cos L\\ \dot{k}=\sqrt{\frac{p}{\mu}}\frac{s^{2}f_{n}}{2w}\sin L\\ \dot{L}=\sqrt{\frac{p}{\mu}}\left\{\mu\left(\frac{w}{p}\right)^{2}+\frac{1}{w}% \left(h\sin Lk\cos L\right)f_{n}\right\}\\ \dot{m}=\frac{\sqrt{f_{r}^{2}+f_{t}^{2}+f_{n}^{2}}}{I_{sp}g_{0}}\end{array}  (1) 
where, w=1+f\cos L+g\sin L and s^{2}=1+h^{2}+k^{2} and f_{r},f_{t},f_{n} are the radial, tangential and normal components of the force generated by the spacecraft propulsion system. The gravity parameter is denoted with \mu and the gravitational acceleration at sea level with g_{0}.
It is useful to rewrite the equations above using a more concise notation. We thus introduce the matrices \mathbf{B} and \mathbf{D} defined as:
\sqrt{\frac{\mu}{p}}\mathbf{B}(\mathbf{x})=\left[\begin{array}[]{ccc}0&\frac{2% p}{w}&0\\ \sin L&[(1+w)\cos L+f]\frac{1}{w}&\frac{g}{w}(h\sin Lk\cos L)\\ \cos L&[(1+w)\sin L+g]\frac{1}{w}&\frac{f}{w}(h\sin Lk\cos L)\\ 0&0&\frac{1}{w}\frac{s^{2}}{2}\cos L\\ 0&0&\frac{1}{w}\frac{s^{2}}{2}\sin L\\ 0&0&\frac{1}{w}(h\sin Lk\cos L)\\ \end{array}\right]  (2) 
and
\mathbf{D}(\mathbf{x})=\left[\begin{array}[]{cccccc}0&0&0&0&0&\sqrt{\mu p}% \left(\frac{w}{p}\right)^{2}\end{array}\right]  (3) 
and rewrite the equations of motion in the form:
\left\{\begin{array}[]{l}\dot{\mathbf{x}}=\frac{c_{1}u(t)}{m}\mathbf{B}(% \mathbf{x})\mathbf{\hat{i}_{\tau}}+\mathbf{D}(\mathbf{x})\\ \dot{m}=c_{2}u(t)\end{array}\right.  (4) 
where the spacecraft thrust is now indicated by c_{1}u\mathbf{\hat{i}_{\tau}}=[f_{r},f_{t},f_{n}]^{T} and bounded by u(t)\leq 1 and \mathbf{\hat{i}_{\tau}}(t)=1. We will use the notation \mathbf{u}\in\mathcal{U} to indicate the feasible control space. Furthermore, we define c_{2}=1/(I_{sp}\ g_{0}). Note that in order to steer the spacecraft, at each instant we control the throttle magnitude u(t)\in[0,1] and the thrust direction \mathbf{\hat{i}_{\tau}}. We refer to these control variables also with the single symbol \mathbf{u}=[u,\mathbf{\hat{i}_{\tau}}].
2.1 The LowThrust Problem
We consider here a free time orbital transfer problem, that is finding the controls u(t) and \mathbf{\hat{i}_{\tau}}(t) defined in [0,t_{f}] and the transfer time t_{f} so that the functional:
J(u(t),t_{f})=\int_{0}^{t_{f}}\left\{u\epsilon\log{[u(1u)]}\right\}dt  (5) 
is minimized, and the spacecraft is steered from its initial mass m_{0} and some initial point \mathbf{x}_{0} to some final mass m_{f} and some final point \mathbf{x}_{f}\in\mathcal{S}_{f}\subset\mathbb{R}^{6}. The functional J, following the work of Bertrand and Epenoy [2], is parameterised by a continuation parameter \epsilon\in[0,1] which activates a logarithmic barrier smoothing the problem and ensuring that the constraint u(t)\in[0,1] is always satisfied. Clearly for \epsilon\rightarrow 0 the functional becomes J=(m_{0}m_{f})/c_{2}, equivalent to minimizing the propellant mass.
2.2 Consequences of Pontryagin’s Minimum Principle
Following the work of Pontryagin [8], we can infer the necessary conditions for an optimal solution of this problem by applying Pontryagin’s minimum principle.^{1}^{1}1Note that we have stated a minimization problem, hence the conditions are actually slightly different from the ones originally derived in Pontryagin’s work.
Let us introduce the costates \boldsymbol{\lambda},\lambda_{m} as continuous functions defined in [0,t_{f}] and define the Hamiltonian:
\mathcal{H}(\mathbf{x},m,\boldsymbol{\lambda},\lambda_{m},\mathbf{u})=\frac{c_% {1}u}{m}\boldsymbol{\lambda}^{T}\mathbf{B}(\mathbf{x})\mathbf{\hat{i}_{\tau}}+% \lambda_{L}\sqrt{\frac{\mu}{p^{3}}}w^{2}c_{2}\lambda_{m}u+\left\{u\epsilon% \log[u(1u)]\right\}  (6) 
and the system of equations:
\left\{\begin{array}[]{l}\dot{\mathbf{x}}=\frac{\partial\mathcal{H}}{% \boldsymbol{\lambda}}=\frac{c_{1}u(t)}{m}\mathbf{B}\mathbf{\hat{i}_{\tau}}(t)+% \mathbf{D}\\ \dot{m}=\frac{\partial\mathcal{H}}{\partial\lambda_{m}}=c_{2}u(t)\\ \dot{\boldsymbol{\lambda}}=\frac{\partial\mathcal{H}}{\partial\mathbf{x}}\\ \dot{\lambda}_{m}=\frac{\partial\mathcal{H}}{\partial m}\\ \end{array}\right.  (7) 
The explicit form of the various derivatives appearing in the equations above is reported in Appendix A.
Along an optimal trajectory, the Hamiltonian must be zero (free terminal time problem) and minimal w.r.t. the choices of u and \mathbf{\hat{i}_{\tau}}. For the optimal thrust direction \mathbf{\hat{i}_{\tau}}^{*} it follows that
\mathbf{\hat{i}_{\tau}}=\mathbf{\hat{i}_{\tau}}^{*}(t)=\frac{\mathbf{B}^{T}% \boldsymbol{\lambda}}{\mathbf{B}^{T}\boldsymbol{\lambda}}  (8) 
where the time dependence on the right hand side follows both from the costates and from \mathbf{B} and has not been indicated explicitly for brevity. While for the optimal throttle u^{*}, necessarily:
u(t)=u^{*}(t)=\frac{2\epsilon}{2\epsilon+SF(t)+\sqrt{4\epsilon^{2}+SF(t)^{2}}}  (9) 
where we introduce a switching function
SF(t)=1\frac{c_{1}}{m}\mathbf{B}^{T}\boldsymbol{\lambda}c_{2}\lambda_{m}.  (10) 
2.2.1 The two point boundary value problem
The test case considered in this work is that of a transfer from any \mathbf{x}_{0},m_{0} to Venus’ orbit (not a rendezvous). In this case the final value for the mass and the final value for the anomaly L are left free so that some transversality conditions apply (\lambda_{L_{f}}=0,\lambda_{m_{f}}=0). For any starting state \mathbf{x}_{0} and m_{0}, we need to choose the values for \boldsymbol{\lambda}_{0} and \lambda_{m_{0}} and the time t_{f} such that solving the initial value problem for Eq.(7) results in a final state that matches the arrival conditions at Venus orbit, the transversality conditions and the freetime condition on the Hamiltonian. Formally, we introduce the shooting function:
\phi(\boldsymbol{\lambda}_{0},\lambda_{m_{0}},t_{f})=[p_{f}p_{V},f_{f}f_{V},% g_{f}g_{V},h_{f}h_{V},k_{f}k_{V},\lambda_{L_{f}},\lambda_{m_{f}},\mathcal{H% }_{f}]  (11) 
where we indicate with a subscript V the modified equinoctial elements of Venus orbit and with the subscript f the final values of the modified equinoctial elements resulting from numerically integrating Eq.(7) for a time t_{f} and from the initial conditions \mathbf{x}_{0},m_{0},\boldsymbol{\lambda}_{0},\lambda_{m_{0}}. We have thus transformed our optimal control problem into the problem: \phi(\boldsymbol{\lambda}_{0},\lambda_{m_{0}},t_{f})=0.
2.2.2 The nominal trajectory
We consider a spacecraft with mass m_{0}=1500 [kg], a nuclear electric propulsion system specified by I_{sp}=3800 [sec] and c_{1}=0.3 [N]. We look for the optimal transfer from the Earth to Venus orbit with a launch on the 7th of May 2005. The planet ephemerides are computed using JPL lowprecision ephemerides. We solve the optimal control problem by solving the problem \phi(\boldsymbol{\lambda}_{0},\lambda_{m_{0}},t_{f})=0. Note that this is, essentially, a system of eight equations in eight unknowns and can be solved by root finding methods (e.g. Powell, LevenbergMarquardt) as well as by SQP or interior point methods (e.g. SNOPT or IPOPT). As it is well known (see [5] for example), the convergence radius for this problem can get rather small, to the point that if we were to try to directly solve the mass optimal problem (i.e. \epsilon=0) we would fail consistently as almost any initial guess on the costate would not converge. However, solving the problem for \epsilon=0.1 is reasonably simple as convergence is frequently achieved when starting with random costates (we sampled them from a uniform distribution with a standard deviation of 10). Note that we use non dimensional units for the state so that the astronomical unit AU is used for length, the spacecraft initial mass for mass, and the rest is set as to get \mu=1.. Gradually decreasing \epsilon from 0.1 down to 10^{6} (as visualized in Figure 1), allows us to obtain the final mass optimal trajectory which is visualized in Figure 2. We refer to the final trajectory (with \epsilon=10^{6}) as the nominal trajectory in the following. The nominal trajectory reaches the orbit of Venus after t^{*}_{f}=1.376 [years] and spends m_{p}^{*}=210.47 [kg] of propellant.
2.3 Consequences of Bellman’s Principle of Optimality
For \epsilon=0, we can apply Bellman’s principle of optimality [1] to the optimal control problem that we have stated in the previous section. We indicate with v(\mathbf{x},m) the value function, i.e. the optimal value of the functional defined by Eq.(5) for any spacecraft state \mathbf{x},m_{0}. Since the value function is timeindependent, the Hamilton Jacobi Bellman (HJB) equations can be written as:
\displaystyle 0=\min_{\mathbf{u}\in\mathcal{U}}(u+\nabla v\cdot\mathbf{f})  (12)  
\displaystyle\mathbf{u}=\operatorname*{arg\,min}_{\mathbf{u}\in\mathcal{U}}(u+% \nabla v\cdot\mathbf{f}).  (13) 
These equations hold in all points where v(\mathbf{x},m) is differentiable. We use \mathbf{f} to denote the right hand side of Eq.(4) including the mass equation. Comparing Eq.(12) and Eq.(6), it follows that \nabla v=[\boldsymbol{\lambda},\lambda_{m}] and thus \mathcal{H}=u+\nabla v\cdot\mathbf{f}. Thus, the costates of Pontryagin’s theory are the gradients of the value function introduced in Bellman’s theory. This fact, albeit rarely exploited in interplanetary trajectory optimization research, provides a convenient basis for the design of new learning procedures, which we introduce and evaluate later in Section 4.
3 Building a Dataset of Optimal Trajectories
3.1 Reducing the Cost of Computation
The availability of largescale datasets is essential to train DNNs. This section shows how to create a large number of optimal trajectories without the need to solve any particular optimal control problem except for the nominal trajectory. The bruteforce approach for computing a dataset of optimal trajectories is to decide on a number of interesting initial states and then solve the corresponding optimal control problem for each( e.g. following all the steps outlined in Section 2). This approach scales extremely poorly because of the known computational difficulties associated with solving optimal control problems, both using direct and indirect methods.
Previous work (e.g. Sanchez and Izzo [10, 9]) deployed a continuation (homotopy) approach to reduce some of the complexity involved by eliminating the need to search for a new initial guess all the time: by perturbing the initial state of a nominal trajectory, the unperturbed costate provides (most of the time) a good initial guess to solve the newly created two points boundary value problem. However, assuming that no convergence issues occur, it is still necessary to solve Eq.(11) for the new initial conditions considered, which incurs a significant computational cost.
In the following, we describe a more efficient way to obtain a similar result avoiding entirely convergence issues, guaranteeing optimality and reducing the computational costs down to the cost of a single numerical propagation of Eq.(7) for a new trajectory. The idea, surprisingly simple and applicable more generally than this Venus orbit acquisition task, is to perform a backward in time propagation of Eq.(7) starting from suitably perturbed final values of the state and costates of the nominal trajectory. This perturbation needs to be chosen such that the tranversality conditions and the condition on the Hamiltonian are still satisfied.
Formally, consider the nominal optimal trajectory and indicate with \mathbf{x}_{f},m_{f},\boldsymbol{\lambda}_{f},\lambda_{m_{f}} the final values (i.e. at t^{*}_{f}) of the states and the costates. Consider the new set of final costate values:
\boldsymbol{\lambda}^{\prime}_{f}=\boldsymbol{\lambda}_{f}+\delta\boldsymbol{\lambda}  (14) 
where the perturbation \delta\boldsymbol{\lambda} is in some ball B_{\rho}\in\mathbb{R}^{7} of size \rho such that the transversality condition on the anomaly (\delta\lambda_{L}=0,\ \delta\lambda_{m}=0) is satisfied. Note that the trajectory that results from propagating backward in time Eq.(7) from the new final conditions \mathbf{x}_{f},m_{f},\boldsymbol{\lambda}^{\prime}_{f},\lambda_{m_{f}} is fulfilling all of Pontryagin’s necessary conditions for optimality, except \mathcal{H}_{f}=0. It is possible to find new final values m^{\prime}_{f}=m_{f}+\delta m and L^{\prime}_{f}=L_{f}+\delta L (e.g. with a simple root finding algorithm) so that the condition on the Hamiltonian is fulfilled. Thus the new trajectory resulting from propagating backward in time Eq.(7) from the conditions \mathbf{x}^{\prime}_{f},m^{\prime}_{f},\boldsymbol{\lambda}^{\prime}_{f},% \lambda_{m_{f}} is now fulfilling all of Pontryagin’s necessary conditions for optimality. Such a trajectory will neither end where the nominal trajectory ends, nor will it start from where the nominal trajectory starts, but it is nevertheless optimal (with respect to Eq. 5) and represents a valid sample to learn from. This procedure reduces the cost of computing one more valid training sample to that of a backward in time integration of Eq.(7) plus the (negligible) cost of a single root finding call to solve \mathcal{H}_{f}(m^{\prime}_{f},L^{\prime}_{f})=0.
As we only modify m_{f} and L, our final state is always constrained to be on the orbit of Venus, and since our perturbations for the final costates are small the timing and direction of thrust maneuvers are close to the nominal trajectory during the backward integration. Together, these imply that it is unlikely, if not impossible, for the final state of the backward integration to go significantly beyond the orbit of Earth during an integration period of t^{*}_{f}=1.376\ [years]. This is confirmed when looking at the dataset in the next subsection.
3.2 The dataset
To build the particular dataset used in this work, we start from the nominal trajectory of Figure 1 and generate 500 000 perturbations of the arrival state \mathbf{x}^{\prime}_{f},m^{\prime}_{f},\boldsymbol{\lambda}^{\prime}_{f} using \rho=0.1 as a perturbation radius. Only perturbations for which the root finder for \mathcal{H}_{f}(L^{\prime}_{f},m^{\prime}_{f})=0 converged are kept. These valid final states are then integrated backwards in time for t_{f}=1.376\ [\textrm{years}]. At 100 equispaced points in time, the state \mathbf{x}^{\prime},m^{\prime},\boldsymbol{\lambda}^{\prime} and the optimal policy u*,\mathbf{i}_{\tau}^{*} are sampled and stored in the dataset. Figure 3 shows an overlapping plot of these trajectories.
During the dataset generation, 453 212 feasible perturbations were found. Thus the final dataset consists of 45 321 200 samples which took approx. 12 hours to generate using 20 threads and 25 000 perturbations per thread on an Intel Xeon E52650L v4 processor at 1.70 GHz. Thus, we were able to generate a new optimal trajectory approximately every 1.7 seconds. The bulk of the computational time is spent to integrate Eq.(7) numerically with a relative error tolerance of 10^{13} and an absolute error tolerance of 10^{13} using the LSODA integrator. In Figure 3 we visualize the trajectory dataset in its entirety. Note that all trajectories we will learn from are thus time free, optimal control trajectories.
4 Experiments
Using the dataset constructed in the previous section, our goal is to evaluate if it is possible to learn a control policy and/or the value function by a deep neural network.
4.1 Ideas Behind Our Experiments
Since this goal can take many forms, we are particularly interested in the following learning tasks:

Value function learning: In the second experiment we train a network \mathcal{N}_{v}(\mathbf{x},m) to predict the value function v(\mathbf{x},m)=m_{p}/c_{2} (i.e. the optimal propellant mass required to get to Venus’ orbit). The loss function used during training and evaluated on a batch of size b is \ell=\sum_{i=1}^{b}v(\mathbf{x}_{i},m_{i})\mathcal{N}_{v}(\mathbf{x}_{i},m_{% i})^{2}. Once the network is trained, the Hamilton Jacobi Bellman equation Eq.(13) is used to derive the policy \mathbf{u}(\mathbf{x},m). Besides providing a way to compute \mathbf{u}(\mathbf{x},m), this network can also be used for predicting the propellant consumption in preliminary design phases.

Value function gradients learning: In the third experiment we train a network \mathcal{N}_{\nabla v}(\mathbf{x},m) to predict the value function v(\mathbf{x},m) and enforce the constraint that \left[\partial_{\mathbf{x}}\mathcal{N}_{\nabla v},\partial_{m}\mathcal{N}_{% \nabla v}\right]=\nabla v=\left[\boldsymbol{\lambda},\lambda_{m}\right]. The idea being that gains are to be achieved, with respect to the previous case, as Eq.(13) contains the value function gradient which is here directly learned. The loss function used during training and evaluated on a batch of size b is thus:
\displaystyle\ell=\sum_{i=1}^{b}\left\{v(\mathbf{x}_{i},m_{i})\mathcal{N}_{% \nabla v}(\mathbf{x}_{i},m_{i})^{2}+\left\boldsymbol{\lambda}_{i}\frac{% \partial\mathcal{N}_{\nabla v}}{\partial\mathbf{x}}(\mathbf{x}_{i},m_{i})% \right^{2}+\left\lambda_{m_{i}}\frac{\partial\mathcal{N}_{\nabla v}}{% \partial m}(\mathbf{x}_{i},m_{i})\right^{2}\right\} (15)
4.2 Network Architecture and Training
The network architectures used for the three experiments are mostly similar except for the depth and the output layers. In all cases, the networks take as input the state \left[\mathbf{x},\ m\right] and consist of n hidden layers with softplus activations between each layer (following the results in [12]).
The policy imitation (experiment 1) network consists of n=4 hidden layers with 100 units. The output layer has a sigmoid activation and 4 units (throttle and thrust vector).
The value function networks (experiment 2 and 3) consist of n=9 hidden layers with 100 units. The output layer has a linear activation and 1 unit. The reason for a linear activation is to avoid normalizing the output, which would also effect the computed gradients used in the loss function.
The network output can then be differentiated with respect to its inputs to estimate the value function gradients necessary to derive the control policy from Eq.(13). In the case of the value function network (experiment 2), these gradients are only used after training to compute the policy. For the value function gradients network (experiment 3), the mean square error of the estimated value function gradients with respect to the costates is added to the lossfunction of the network. Thus the training for the value function gradient network is minimizing both the error of the valuefunction and its corresponding gradients.
In terms of the training set, the policy network was trained on a normalized dataset where the normalization consists of a standard scaling (i.e. removing the mean and scaling to unit variance). The targets u and \mathbf{i}_{\tau} have been scaled to [0,1] in order to match the output of the last sigmoid activations. During preliminary experiments, we found that for the value function network and the value function gradient network, a better performance was achieved by omitting any normalization of the data.
We split the data into a train, validation and test set by a ratio of 0.8/0.1/0.1. All networks were trained on minibatches of size b=8192 using the Amsgrad optimizer with an initial learning rate of 10^{5} for 300 epochs. For each experiment, we select the network with the lowest mean square error on the validation set for all further computations presented in this work.
5 Results
We evaluate the quality of the trained networks by two different methods. First, we test the network performance on the test set, containing optimal state control pairs from trajectories never seen during training. Second, and more importantly, we test how well the network performs when used to control the spacecraft dynamics, as computed by integrating the equations of motion in Eq.(4) forward in time starting from the initial state of the nominal trajectory (which was excluded from the training, validation and test sets) using the policies resulting from the network for \mathbf{u}.
5.1 Evaluation of Control Predictions on the Test Set
controller  u  i_{r}  i_{t}  i_{n} 

\mathcal{N}_{\mathbf{u}}  4.1\cdot 10^{3}\pm 1.6\cdot 10^{3}  9.4\cdot 10^{4}\pm 2.0\cdot 10^{6}  5.4\cdot 10^{4}\pm 8.8\cdot 10^{7}  2.1\cdot 10^{3}\pm 5.0\cdot 10^{5} 
\mathcal{N}_{v}  3.8\cdot 10^{1}\pm 2.3\cdot 10^{1}  1.8\cdot 10^{2}\pm 1.8\cdot 10^{4}  3.7\cdot 10^{2}\pm 7.3\cdot 10^{4}  1.6\cdot 10^{1}\pm 5.6\cdot 10^{3} 
\mathcal{N}_{\nabla v}  6.7\cdot 10^{2}\pm 6.1\cdot 10^{2}  3.0\cdot 10^{3}\pm 6.2\cdot 10^{6}  8.5\cdot 10^{4}\pm 7.3\cdot 10^{7}  2.6\cdot 10^{3}\pm 4.7\cdot 10^{6} 
For each sample of the test set we compute the optimal policy prediction from the networks (u and \mathbf{\hat{i}_{\tau}}) and thus the mean and standard deviation of the absolute error. In Table 1, the values for each of the control components are shown. The policy network displays the best performance in terms of the u, i_{r} and i_{t} components. On the other hand, the value function gradients network outperforms the policy network in terms of the i_{n} component with regards to the spread in errors across the test set despite an overall higher mean absolute error. The value function network without gradient learning, while approximating the value function reasonably well, fails to reconstruct the optimal policy. This suggests that learning the gradients in addition to the value function is necessary for the reconstruction of an optimal control profile.
A more detailed view of the error distribution is shown in Figure 6 for the throttle and Figure 7 for the thrust direction, confirming that the value function gradients network misthrottles (i.e. throttling when it should not and not throttling when it should) on average more frequently than the policy network. It is interesting to note, again, that in terms of the distribution of errors in the thrust direction, the value function gradients network achieves a narrower distribution of errors centered around 0 for the u_{n} component of the thrust.
5.2 Closing the Gap to Venus Orbit
controller  m_{f} [kg]  \Delta m [kg]  m_{f}+\Delta m [kg] 

optimal  210.4742  –  210.4742 (+ 0.0000) 
\mathcal{N}_{\mathbf{u}}  210.3197  0.1858  210.5055 (+ 0.0313) 
\mathcal{N}_{v}  151.1732  129.7239  280.8971 (+ 80.7503) 
\mathcal{N}_{\nabla v}  205.4894  5.6123  211.1017 (+ 0.6275) 
Next, we compare how well the trained neural networks are able to reproduce the nominal trajectory (which was not part of the dataset used for training) during a forward in time numerical integration of the equation of motion stated in Eq.(4). Starting from the nominal initial conditions, we thus predict the control actions using each network and integrate forward in time for a period of t_{f}^{*}=1.376 [years] (the time needed by the nominal trajectory to reach the orbit of Venus). Figure 8 shows the distance from the orbit of Venus obtained in the three cases. We observe that \mathcal{N}_{\mathbf{u}} and \mathcal{N}_{\nabla v} come closest to Venus orbit while \mathcal{N}_{v} is comparatively far away.
For a fair comparison between all three strategies with respect to the nominal trajectory, we correct all orbits (in a massoptimal way) to match the exact orbit of Venus and add the corresponding propellant mass (\Delta m) needed for this corrective maneuver to the propellant mass (m_{f}) spent by the controller. Consequently, we solve the optimal control problem for reaching Venus from each of the final orbits reached by the networks. The results are reported in Table 2. We note that the additional mass needed by \mathcal{N}_{\mathbf{u}} and \mathcal{N}_{\nabla v} to close the gap to Venus orbit is below 1 [kg].
6 Conclusions and Discussion
The results presented show that our strategy for the generation of optimal trajectories is able to provide datasets of suitable size for supervised training of deep neural networks. The major benefit of this technique is the low computational cost and its reliability, as the usage of complex solvers for optimal control problems can be avoided in favor of a much simpler numerical integration procedure. For the test case of an interplanetary transfer to Venus orbit, we find that the small disturbances \rho, which we applied to the final costates, allowed us to generate enough variety within the dataset to learn reliable control policies by imitation learning. While the policy network is structurally simple and comparatively small in size, the achieved precision is promising and provides a first step towards the goal of achieving onboard optimal guidance and control by a such a neural network.
Additionally to the prediction of the optimal control policy, the value networks are able to estimate the value function up to high precision and thus provide an approximation to the massbudget needed for optimal transfers from different starting points. If the error of the gradients of the value function is incorporated during training, a control profile equally good as the imitation learning can be obtained.
While these preliminary results are encouraging, we plan to further study the potential of this approach for designing more ambitious transfers. In particular, the perturbation parameter of the final costates \rho is the key for this approach as it significantly impacts the resulting trajectory dataset, its needed size and, subsequently, the quality and usability of the trained networks. Limiting \rho to a comparatively small value allowed the trajectories in our dataset to resemble the nominal trajectory closely. However, the investigation of larger values of \rho is of great interest and remains for future work. The dataset size and its relation to \rho is also of great interest, since in this work they were both selected in a somewhat arbitrary fashion.
Sampling of the trajectories at equally spaced points in time resulted in a bias of the dataset, as it leads to introduce more sample points at larger distances from the target orbit. Although this did not pose a problem during our experiments, a different sampling strategy could easily reduce the bias and would be most likely beneficial for training on challenging transfers. Furthermore, foregoing the computation of a nominal trajectory seems entirely possible and worth investigating, for example by generating the final costates at random (and not perturbing some initial values).
Lastly we note that for the value function gradient learning as presented here, the costates need to be known (i.e. part of the dataset). While our approach is able to deliver the costates for each trajectory, a direct method for solving the optimal control problem would not. In such cases we hypothesize that adding Eq.(12) to the loss function might improve the quality of the controller even if information on the costates is not available.
Appendix A Appendix
This appendix contains the explicit forms of all the derivatives necessary to write Eq.(7) explicitly.
The \dot{\lambda}_{p} equation
We get:
\displaystyle\dot{\lambda}_{p}  \displaystyle=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial\mathbf{B% }(\mathbf{x})}{\partial p}\mathbf{i}_{\tau}w^{2}\lambda_{L}\frac{\partial}{% \partial p}\sqrt{\frac{\mu}{p^{3}}}  (16)  
\displaystyle=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial\mathbf{B% }(\mathbf{x})}{\partial p}\mathbf{i}_{\tau}+\frac{3}{2}w^{2}\lambda_{L}\sqrt{% \frac{\mu}{p^{5}}} 
\displaystyle 2\sqrt{\mu p}\frac{\partial\mathbf{B}(\mathbf{x})}{\partial p}=% \\ \displaystyle=\left[\begin{array}[]{ccc}\begin{array}[]{ccc}0&\frac{6p}{w}&0\\ \sin L&[(1+w)\cos L+f]\frac{1}{w}&\frac{g}{w}(h\sin Lk\cos L)\\ \cos L&[(1+w)\sin L+g]\frac{1}{w}&\frac{f}{w}(h\sin Lk\cos L)\\ 0&0&\frac{1}{w}\frac{s^{2}}{2}\cos L\\ 0&0&\frac{1}{w}\frac{s^{2}}{2}\sin L\\ 0&0&\frac{1}{w}(h\sin Lk\cos L)\\ \end{array}\end{array}\right]  (17) 
The \dot{\lambda}_{f} equation
We get:
\displaystyle\dot{\lambda}_{f}  \displaystyle=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial\mathbf{B% }(\mathbf{x})}{\partial f}\mathbf{i}_{\tau}2\lambda_{L}w\sqrt{\frac{\mu}{p^{3% }}}\frac{\partial w}{\partial f}  (18)  
\displaystyle=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial\mathbf{B% }(\mathbf{x})}{\partial f}\mathbf{i}_{\tau}2\lambda_{L}w\sqrt{\frac{\mu}{p^{3% }}}\cos L 
\displaystyle w^{2}\sqrt{\frac{\mu}{p}}\frac{\partial\mathbf{B}(\mathbf{x})}{% \partial f}=\\ \displaystyle=\left[\begin{array}[]{ccc}\begin{array}[]{ccc}0&2p\cos L&0\\ 0&w(\cos L+f)\cos L&g\cos L(h\sin Lk\cos L)\\ 0&(\sin L+g)\cos L&(wf\cos L)(h\sin Lk\cos L)\\ 0&0&\frac{s^{2}}{2}\cos^{2}L\\ 0&0&\frac{s^{2}}{2}\sin L\cos L\\ 0&0&(h\sin Lk\cos L)\cos L\\ \end{array}\end{array}\right]  (19) 
The \dot{\lambda}_{g} equation
We get:
\displaystyle\dot{\lambda}_{g}  \displaystyle=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial\mathbf{B% }(\mathbf{x})}{\partial g}\mathbf{i}_{\tau}2\lambda_{L}w\sqrt{\frac{\mu}{p^{3% }}}\frac{\partial w}{\partial g}  (20)  
\displaystyle=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial\mathbf{B% }(\mathbf{x})}{\partial g}\mathbf{i}_{\tau}2\lambda_{L}w\sqrt{\frac{\mu}{p^{3% }}}\sin L 
\displaystyle w^{2}\sqrt{\frac{\mu}{p}}\frac{\partial\mathbf{B}(\mathbf{x})}{% \partial g}=\\ \displaystyle=\left[\begin{array}[]{ccc}0&{2p}\sin L&0\\ 0&(\cos L+f)\sin L&(wg\sin L)(h\sin Lk\cos L)\\ 0&w(\sin L+g)\sin L&f\sin L(h\sin Lk\cos L)\\ 0&0&\frac{s^{2}}{2}\cos L\sin L\\ 0&0&\frac{s^{2}}{2}\sin^{2}L\\ 0&0&(h\sin Lk\cos L)\sin L\\ \end{array}\right]  (21) 
The \dot{\lambda}_{h} equation
We get:
\displaystyle\dot{\lambda}_{h}  \displaystyle=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial\mathbf{B% }(\mathbf{x})}{\partial h}\mathbf{i}_{\tau}  (22) 
\frac{\partial\mathbf{B}(\mathbf{x})}{\partial h}=\sqrt{\frac{p}{\mu}}\left[% \begin{array}[]{ccc}0&0&0\\ 0&0&\frac{g}{w}\sin L\\ 0&0&\frac{f}{w}\sin L\\ 0&0&\frac{h}{w}\cos L\\ 0&0&\frac{h}{w}\sin L\\ 0&0&\frac{1}{w}\sin L\\ \end{array}\right] 
The \dot{\lambda}_{k} equation
We get:
\displaystyle\dot{\lambda}_{k}  \displaystyle=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial\mathbf{B% }(\mathbf{x})}{\partial k}\mathbf{i}_{\tau}  (23) 
\frac{\partial\mathbf{B}(\mathbf{x})}{\partial k}=\sqrt{\frac{p}{\mu}}\left[% \begin{array}[]{ccc}0&0&0\\ 0&0&\frac{g}{w}\cos L\\ 0&0&\frac{f}{w}\cos L\\ 0&0&\frac{k}{w}\cos L\\ 0&0&\frac{k}{w}\sin L\\ 0&0&\frac{1}{w}\cos L\\ \end{array}\right] 
The \dot{\lambda}_{L} equation
We get:
\dot{\lambda}_{L}=\frac{c_{1}u}{m}\boldsymbol{\lambda}^{T}\frac{\partial% \mathbf{B}(\mathbf{x})}{\partial L}\mathbf{i}_{\tau}2w\sqrt{\frac{\mu}{p^{3}}% }\lambda_{L}w_{L}  (24) 
where,
\displaystyle w^{2}\sqrt{\frac{\mu}{p}}\frac{\partial\mathbf{B}(\mathbf{x})}{% \partial L}=\\ \displaystyle\left[\begin{array}[]{ccc}0&2p(g\cos Lf\sin L)&0\\ w^{2}\cos L&(1+w)w\sin Lw_{L}(\cos L+f)&((wh+w_{L}k)\cos L+(wkw_{L}h)\sin L% )g\\ w^{2}\sin L&(1+w)w\cos Lw_{L}(\sin L+g)&((wh+w_{L}k)\cos L+(wkw_{L}h)\sin L)% f\\ 0&0&\frac{s^{2}}{2}(w\sin L+w_{L}\cos L)\\ 0&0&\frac{s^{2}}{2}(w\cos Lw_{L}\sin L)\\ 0&0&{(wh+w_{L}k)\cos L+(wkw_{L}h)\sin L}\\ \end{array}\right]  (25) 
where,
w_{L}=\frac{\partial w}{\partial L}=g\cos Lf\sin L  (26) 
The \dot{\lambda}_{m} equation
We get:
\displaystyle\dot{\lambda}_{m}  \displaystyle=\frac{c_{1}u}{m^{2}}\boldsymbol{\lambda}^{T}\mathbf{B}(\mathbf% {x})  (27) 
References
 [1] Bellman, R. Dynamic programming. Science 153, 3731 (1966), 34–37.
 [2] Bertrand, R., and Epenoy, R. New smoothing techniques for solving bang–bang optimal control problems–numerical results and statistical interpretation. Optimal Control Applications and Methods 23, 4 (2002), 171–197.
 [3] Cheng, L., Wang, Z., Song, Y., and Jiang, F. Realtime optimal control for irregular asteroid landings using deep neural networks. arXiv preprint arXiv:1901.02210 (2019).
 [4] Furfaro, R., Bloise, I., Orlandelli, M., and Di Lizia, P. Deep learning for autonomous lunar landing. In 2018 AAS/AIAA Astrodynamics Specialist Conference (2018), pp. 1–22.
 [5] Haberkorn, T., Martinon, P., and Gergaud, J. Low thrust minimumfuel orbital transfer: a homotopic approach. Journal of Guidance, Control, and Dynamics 27, 6 (2004), 1046–1060.
 [6] Izzo, D., Märtens, M., and Pan, B. A survey on artificial intelligence trends in spacecraft guidance dynamics and control. arXiv preprint arXiv:1812.02948 (2018).
 [7] Izzo, D., Sprague, C., and Tailor, D. Machine learning and evolutionary techniques in interplanetary trajectory design. arXiv preprint arXiv:1802.00180 (2018).
 [8] Pontryagin, L., Boltyanskii, V., Gamkrelidze, R., and Mishchenko, E. The Mathematical theory of optimal processes. Wiley & Sons, 1962.
 [9] SánchezSánchez, C., and Izzo, D. Realtime optimal control via deep neural networks: study on landing problems. Journal of Guidance, Control, and Dynamics 41, 5 (2018), 1122–1135.
 [10] SánchezSánchez, C., Izzo, D., and Hennes, D. Learning the optimal statefeedback using deep networks. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (2016), IEEE, pp. 1–8.
 [11] Sprague, C. I., Izzo, D., and Ögren, P. Learning a family of optimal state feedback controllers. arXiv preprint arXiv:1902.10139 (2019).
 [12] Tailor, D., and Izzo, D. Learning the optimal statefeedback via supervised imitation learning. arXiv preprint arXiv:1901.02369 (2019).
 [13] Walker, M. J. H., Ireland, B., and Owens, J. A set of modified equinoctial orbit elements. Celestial Mechanics 36 (Aug. 1985), 409–419.
 [14] Zhu, Y.h., and Luo, Y.z. Fast approximation of optimal perturbed manyrevolution multipleimpulse transfers via deep neural networks. arXiv preprint arXiv:1902.03741 (2019).