Inverse Optimal Control from Demonstration Segments

Inverse Optimal Control from Demonstration Segments


This paper develops an inverse optimal control method to learn an objective function from segments of demonstrations. Here, each segment is part of an optimal trajectory within any time interval of the horizon. The unknown objective function is parameterized as a weighted sum of given features with unknown weights. The proposed method shows that each trajectory segment can be transformed into a linear constraint to the unknown weights, and then all available segments are incrementally incorporated to solve for the unknown weights. Effectiveness of the proposed method is shown on a simulated 2-link robot arm and a 6-DoF maneuvering quadrotor system, in each of which only segment data of the systems’ trajectories are available.

I Introduction

With the capability of recovering an objective function of an optimal control system from observations of the system’s trajectories, inverse optimal control (IOC) has been widely applied in imitation learning [1], where a learner mimics an expert by learning the expert’s underlying objective function, autonomous vehicles [12], where human driver’s driving preference is learned and transferred to vehicle controllers, and human-robot interactions [16], where an objective function of human motor control is inferred to enable efficient prediction and coordination.

Existing IOC methods usually assume the unknown objective function could be parameterized as a linear combination of selected features (or basis functions) [19, 18]. Here, each feature characterizes one aspect of the performance of the system operation, such as energy cost, time consumption, risk levels, etc. Then, the goal of IOC becomes estimating the unknown weights for those features [8]. The authors of [22, 23, 25, 26, 10] have adopted a double-layer architecture, where the estimate of the weights is updated in an outer layer while the corresponding optimal trajectory is generated by solving the optimal control problem in an inner layer. Techniques based on the double-layer framework usually suffer high computational cost since optimal control problems need to be solved repeatedly [9]. Recent IOC techniques have been developed by leveraging optimality conditions, which the observed optimal trajectory must satisfy, and thus the unknown weights can be directly obtained by solving the established optimality equations. Related work along this direction includes [11, 21, 7], where Karush-Kuhn-Tucker conditions are used, [17], where Pontryagin’s maximum principle [20] are used.

Despite significant progress achieved as described above, most existing IOC methods cannot learn the objective function unless a complete system trajectory within an entire time horizon is observed. Such requirement of observations has limited their capabilities in the cases where only incomplete trajectory data, or even sparse data, is available, for example, due to limited sensing capability, sensor failures, or occlusion [5, 2]. In [2], given sparse corrections (demonstrations), the authors create an intended trajectory of full horizon based on the sparse data using trajectory shaping/interpolation [6], in order to utilize the maximum margin IOC approach [22]. Although successful in learning from human corrections, it is likely that the artificially-created trajectory might not exactly reflect the actual trajectory of a human expert. In [4], the authors model the missing data using a probability distribution, then both the objective function and the missing part are learned under the maximization-expectation framework. Besides huge computational cost, this work, however, has not provided how percentage of missing information affects learning performance. In the recent work [8], a notion of the recovery matrix has been introduced to solve IOC using incomplete observations, but it still requires the observation data to be consecutive and long enough to satisfy the recovery condition.

In recognition of the above limitations, this paper aims to develop an approach to learn the objective function directly from available demonstration segments, without the attempt to characterize missing information. By saying demonstration segments, we refer to a collection of segments of the system’s trajectory of states and inputs in any time intervals of the horizon; we allow a segment to be a single data point, i.e., a state/input at a single time instant. Each segment may be not sufficient to determine the objective function by itself, an incremental approach will be developed to incorporate all available segments to achieve an estimate of the unknown weights of the objective function.


The column operator stacks its (vector) arguments into a column. denotes a stack of multiple from to (), that is, . (bold-type) denotes a block matrix. Given a vector function and a constant , denotes the Jacobian matrix with respect to evaluated at . Zero matrix/vector is denoted as , and identity matrix as , both with appropriate dimensions. denotes the transpose of matrix .

Ii Problem Statement

Consider an optimal control system with discrete-time dynamics and initial condition as follows1:


where vector function is differentiable; denotes the system state; is the control input; and is the time step. Let


denote a trajectory of system states and inputs in a time horizon . Consider that the system trajectory is a result of optimizing the following objective function:


Here, is a vector of specified features (or basis functions), with each feature differentiable; and is the unknown weight vector with the th element being the weight for feature , .

In inverse optimal control, the goal is to learn the unknown weights for the given features from the full trajectory . Note that scaling by an non-zero constant does not affect the IOC problem because a scaled will result in the same trajectory . Without losing any generality, one can always scale such that its first entry is equal to 1, as adopted in [11], namely,


Suppose that one is accessible to a collection of trajectory segments, denoted by , which is a set of data segments of , and . A segment in is defined as a sequence of system states and inputs , where and denote the starting and end time of such segment, respectively, and . Thus,


where and are the starting and end time of the th available segment. It is worth noting that we do not put any restrictions on , which means that any segment in it can be the full trajectory or even a single input-state point at a time instance in terms of . Different segments are also allowed to have overlaps. here is used to denote number of the segments are currently available, and the total number of segments can be very large. Also, in the method developed below, we do not require the knowledge , i.e., the starting time of each segment relative to the starting time of the system trajectory.

Since each segment in may not be sufficient to determine by itself, thus the problem of interest is to develop an IOC algorithm to achieve the estimate of by incrementally incorporating all segments in .

Iii The Proposed Approach

In this section, we first present the idea of how to establish a constraint on the feature weights from any available segment data, then develop the IOC approach.

Iii-a Key Idea to Utilize Any Trajectory Segment in IOC

Let be any segment of the full trajectory (2) with . Since the full trajectory is generated by the system (1) minimizing (3), there exist a sequence of Lagrange multipliers (or costates) such that following KKT optimality conditions [3] hold for , that is,




is Lagrangian of the optimization (optimal control) problem. From (6), one has the following equations for any :


which can also be achieved based on Pontryagin’s maximum principle [20]. It follows that for any trajectory segment , by stacking (8)-(9) for one has






Dimensions of the above matrices are , , , , and , respectively. In above (10), since is undefined when , we define .

Since the matrix is non-singular, one can eliminate by combining (10) and (11) and obtain




Note that (15) establishes a relation between any data segment , the unknown weights , and the costate . Note that is unknown and actually related to the value function of future information [9]. In order to further eliminate and measure the contribution of each data segment to solving , we introduce the following concept of data effectiveness for IOC problems.

Definition 1 (Effective Data for IOC).

Given system (1) and an arbitrary segment , , we say the segment is data effective if


where is as defined in (17).

It follows from Definition 1 that for any effective segment , the corresponding quantity is non-singular. Thus by multiplying to both sides of (15), one can solve


which together with (15) lead to




In summary, we have the following lemma.

Lemma 1.

For any segment that is data effective in the sense of Definition 1, must satisfy (20).

Lemma 1 bridges between any data-effective segment and the unkonwn objective function weights; that is, any effective segment enforces a set of linear constraints to weights . Thus, more data-effective segments result in more constraints for recovering .

Note that , defined in (17), is uniquely determined by , and , which only rely on the data in and system dynamics . Thus, whether a data segment is effective or not is independent of choices of features , and only determined by the data segment itself and the system model. Furthermore, the effective data condition (18) can be fulfilled efficiently by including additional state-input points into the current data segment, as suggested in following analysis.

Lemma 2.

For any , one has


The proof of Lemma 2 will be given in Appendix. Lemma 2 implies when is non-singular,


The rank non-decreasing property of suggests that the more data points a segment contains, the more likely it will be effective. Indeed, as we will show in later simulations, a segment is usually data-effective when


with being ceiling operator, which is a necessary condition directly suggested by the size of . Specifically, if , even a single state/input point can be effective. Interestingly, from (22) we find that the definition of data-effectiveness is equivalent to the definition of controllablity for linear systems. This means that for any controllable linear system, as long as the length of a segment satisfies (24), the segment is also data-effective.

Iii-B Incremental IOC from Demonstration Segments

Based on Lemma 1, given a collection of data segments in (5), one has


for each segment if it is effective, where is defined in (21). Then for all data-effective segments in , one has the linear equation of the weights:


Here is a stack of for which the corresponding segment is effective. With all data-effective segments stacking into matrix , the following Lemma provides a sufficient condition for successfully estimating the unknown weight vector .

Lemma 3.

Given in (26b), if


then any vector from the kernel of is the scaled version of weight . i.e., where is a scalar.


Condition (27) indicates that the kernel space of is one dimensional. Since (26a) is a necessary condition for the true weight vector , any vector must satisfy where is a scalar.  

Lemma 3 is the sufficient condition to guarantee successfully estimation of the true weight . Failure to satisfy this condition indicates that and thus has a kernel with dimension larger than one, which means a vector in the kernel is not guaranteed to be a scaled version of the true weight. To cope with the rank deficiency, more data segments should be included in in order to fulfill the condition (27). It should be also noted that for a single segment, say , even though it is data effective, i.e., , it does not mean that it is able to satisfy the sufficient condition in (27) and suffice for the successful recovery, for example, it could be even though . Recall that only relies on segment data and dynamics, while additionally relies on features, thus is stricter. We will also illustrate this in the later experiment.

In implementation, since the observation noise and/or sub-optimality exist, directly computing the weights from (26a) thus may only lead to trivial solutions. Thus, as adopted in previous IOC methods [11, 21, 17, 7], one can choose to obtain a least square estimate for the weights by solving the following equivalent optimization,

subject to

Here, stands for the norm; and is called a least-square estimate to the unknown weights .

Based on the formulation in (28a)-(28b), if we consider the segments in are given incrementally, i.e., one segment each time, the following lemma presents an incremental way to solve for the least square estimate .

Lemma 4.

Given the th segment , , let


with and defined in (20). Then the least-square estimate in (28) given previous segments is


A proof of the above lemma will be given in Appendix. Lemma 4 shows that the least square estimate of the weights in (28) can be achieved incrementally by adding the new segment information to the matrix . As is of fixed dimension, there is not additional memory consumption as new available data is included. Given previous segments, the least square estimate of the unknown weights are solved by (30). Based on Lemma 4, we present the IOC algorithm using demonstration segments in Algorithm LABEL:algorithm.


algocf[h]     \end@float

Iv Numerical Experiments

In this section, we evaluate the proposed method on a simulated robot arm and a 6-DoF quadrotor UAV system.

Iv-a Two-link robot arm

Fig. 1: A simulated robot arm.

As shown in Fig. 1, we consider that a two-link robot arm moves in vertical plane with continuous dynamics given by [24, p. 209]


where is the joint angle vector; is the inertia matrix; is the Coriolis matrix; is the gravity vector; and are the torques applied to each joint. The parameters used here follows [24, p. 209]: the link mass , the link length ; the distance from joint to center of mass (COM) , and the moment of inertia with respect to COM . By defining the states and control inputs of the robot arm system


respectively, one could write (31) in state-space representation and further approximate it by the following discrete-time form


where is the discretization interval. The motion of the robot arm is controlled to minimize the objective function (3), which here is set as a weighted distance to the goal state plus the control effort . Here, the corresponding features and weights defined are as follows.


The initial condition of the robot arm is set as , and time horizon is set as . We set the ground-truth weights as in (34), and the resulting optimal trajectory of states and inputs is plotted in Fig. 2

Fig. 2: Optimal trajectory of the robot arm.

In the IOC task, we learn the weight vector from the segment data of the optimal trajectory in Fig. 2. As shown in Table I, we preform five trials, and for each trial we are given a collection of segments of optimal trajectory, as indicated by the corresponding time intervals (second column). We apply Algorithm LABEL:algorithm to obtain the least-square estimate for each trial, and show the estimation results in the last column in Table I.

Trial No. Intervals of segments Estimate
Trial 1
, ,
Trial 2
Trial 3
Trial 4
Trial 5
TABLE I: IOC results from data segments

As shown in Table I, in all trials the algorithm successfully obtains the estimate to the feature weights in (34) except in Trial 4. For all trials, we have randomly selected segments sparsely located in the time horizon. It is noted that the first segment in Trial 1, the second segment in Trial 2, and the second segment in Trial 3 just reach the lower bound length given by (24), i.e., , they all satisfy the effective data condition in (18). This indicates the mild requirement of the effective data condition. Moreover, we have tested other data segments of the trajectory, and observed that most of trajectory segments are effective as long as its length reaches the lower bound, except for those that are very near the end of the trajectory, such as the segments within the time interval . This is because, as the system trajectory in Fig. 2 finally converges to zero, the states and inputs at the end of time horizon are very close to zeros (low-excitation) and thus likely become non-effective.

In Table I, it is also worth noting that Trial 4 fails to recover the true weight vector. This is because although the segment in Trial 4 is data-effective (i.e., ), however, and thus it does not meet the sufficient condition stated in lemma 3 for successful estimation. To address this, we add another segment , as shown in Trial 5, in order to fulfill the sufficient condition in (27), and now . Therefore, Trial 5 successfully estimates the true weight vector .

The above results show that the data-effectiveness condition (18) is mild and easy to fulfill, for example, let the length of segment reach the lower bound. Data-effectiveness is a precondition for a segment to be used for solving IOC problems. Although a single segment is data-effective, it may not necessarily suffice for recovering the weight. The IOC sufficient condition (27) is stricter than the data-effectiveness condition (18), because only relies on segment data and dynamics, while additionally relies on features.

Iv-B Quadrotor UAV

Next, we apply the proposed method to learn the objective function for a 6-DoF quadrotor UAV maneuvering system. Consider a quadrotor UAV with the following dynamics


Here, the subscription and denote a quantity is expressed in the body frame and inertial (world) frame, respectively; and are the mass and moment of inertia with respect to body frame of the UAV, respectively. and are the position and velocity vector of the UAV; is the angular velocity vector of the UAV; is the unit quaternion [13] that describes the attitude of UAV with respect to the inertial frame; is defined as:


is the torque applied to the UAV; is the force vector applied to the UAV center of mass. The total force magnitude (along z-axis of the body frame) and torque are generated by thrust from four rotating propellers , their relationship can be expressed as:


where is the wing length of the UAV and is a fixed constant. Similar to (33), we discretize the above dynamics with discretization interval of 0.1s. The parameters in dynamics are given in Table II.

Parameters Value Unit
0.4 m
1 kg
TABLE II: Dynamics parameters for the Quadrotor UAV

The state and input vectors of the UAV are defined as:


The control objective function of the UAV includes a carefully selected attitude error term. As used in [14], we define the attitude error between UAV’s current attitude and the goal attitude as:


where is the direction cosine matrix [13] directly corresponding to the quaternion . Other error terms that are included in the control objective function are simply the squared distances to their corresponding goals:


where and are the goal position and velocity states respectively.

We generate the UAV optimal trajectory by minimizing a given control objective function. The initial state is set as , and the goal state is set as . The control objective function is written as the weighted distance to the goal state plus the control effort , where the features and weights are defined as follows:


with corresponding weights:


The time horizon is set to .

Similar to the previous experiment, we set up four trials, and for each trial we observe different segments of the optimal state trajectories, as listed in the second column in Table III. The result of feature weights estimation is shown the last column in Table III.

Trial No. Intervals of segments Estimate
Trial 1
Trial 2
, ,