Inverse Optimal Control from Demonstration Segments
Abstract
This paper develops an inverse optimal control method to learn an objective function from segments of demonstrations. Here, each segment is part of an optimal trajectory within any time interval of the horizon. The unknown objective function is parameterized as a weighted sum of given features with unknown weights. The proposed method shows that each trajectory segment can be transformed into a linear constraint to the unknown weights, and then all available segments are incrementally incorporated to solve for the unknown weights. Effectiveness of the proposed method is shown on a simulated 2link robot arm and a 6DoF maneuvering quadrotor system, in each of which only segment data of the systems’ trajectories are available.
I Introduction
With the capability of recovering an objective function of an optimal control system from observations of the system’s trajectories, inverse optimal control (IOC) has been widely applied in imitation learning [1], where a learner mimics an expert by learning the expert’s underlying objective function, autonomous vehicles [12], where human driver’s driving preference is learned and transferred to vehicle controllers, and humanrobot interactions [16], where an objective function of human motor control is inferred to enable efficient prediction and coordination.
Existing IOC methods usually assume the unknown objective function could be parameterized as a linear combination of selected features (or basis functions) [19, 18]. Here, each feature characterizes one aspect of the performance of the system operation, such as energy cost, time consumption, risk levels, etc. Then, the goal of IOC becomes estimating the unknown weights for those features [8]. The authors of [22, 23, 25, 26, 10] have adopted a doublelayer architecture, where the estimate of the weights is updated in an outer layer while the corresponding optimal trajectory is generated by solving the optimal control problem in an inner layer. Techniques based on the doublelayer framework usually suffer high computational cost since optimal control problems need to be solved repeatedly [9]. Recent IOC techniques have been developed by leveraging optimality conditions, which the observed optimal trajectory must satisfy, and thus the unknown weights can be directly obtained by solving the established optimality equations. Related work along this direction includes [11, 21, 7], where KarushKuhnTucker conditions are used, [17], where Pontryagin’s maximum principle [20] are used.
Despite significant progress achieved as described above, most existing IOC methods cannot learn the objective function unless a complete system trajectory within an entire time horizon is observed. Such requirement of observations has limited their capabilities in the cases where only incomplete trajectory data, or even sparse data, is available, for example, due to limited sensing capability, sensor failures, or occlusion [5, 2]. In [2], given sparse corrections (demonstrations), the authors create an intended trajectory of full horizon based on the sparse data using trajectory shaping/interpolation [6], in order to utilize the maximum margin IOC approach [22]. Although successful in learning from human corrections, it is likely that the artificiallycreated trajectory might not exactly reflect the actual trajectory of a human expert. In [4], the authors model the missing data using a probability distribution, then both the objective function and the missing part are learned under the maximizationexpectation framework. Besides huge computational cost, this work, however, has not provided how percentage of missing information affects learning performance. In the recent work [8], a notion of the recovery matrix has been introduced to solve IOC using incomplete observations, but it still requires the observation data to be consecutive and long enough to satisfy the recovery condition.
In recognition of the above limitations, this paper aims to develop an approach to learn the objective function directly from available demonstration segments, without the attempt to characterize missing information. By saying demonstration segments, we refer to a collection of segments of the system’s trajectory of states and inputs in any time intervals of the horizon; we allow a segment to be a single data point, i.e., a state/input at a single time instant. Each segment may be not sufficient to determine the objective function by itself, an incremental approach will be developed to incorporate all available segments to achieve an estimate of the unknown weights of the objective function.
Notations
The column operator stacks its (vector) arguments into a column. denotes a stack of multiple from to (), that is, . (boldtype) denotes a block matrix. Given a vector function and a constant , denotes the Jacobian matrix with respect to evaluated at . Zero matrix/vector is denoted as , and identity matrix as , both with appropriate dimensions. denotes the transpose of matrix .
Ii Problem Statement
Consider an optimal control system with discretetime dynamics and initial condition as follows
(1) 
where vector function is differentiable; denotes the system state; is the control input; and is the time step. Let
(2) 
denote a trajectory of system states and inputs in a time horizon . Consider that the system trajectory is a result of optimizing the following objective function:
(3) 
Here, is a vector of specified features (or basis functions), with each feature differentiable; and is the unknown weight vector with the th element being the weight for feature , .
In inverse optimal control, the goal is to learn the unknown weights for the given features from the full trajectory . Note that scaling by an nonzero constant does not affect the IOC problem because a scaled will result in the same trajectory . Without losing any generality, one can always scale such that its first entry is equal to 1, as adopted in [11], namely,
(4) 
Suppose that one is accessible to a collection of trajectory segments, denoted by , which is a set of data segments of , and . A segment in is defined as a sequence of system states and inputs , where and denote the starting and end time of such segment, respectively, and . Thus,
(5) 
where and are the starting and end time of the th available segment. It is worth noting that we do not put any restrictions on , which means that any segment in it can be the full trajectory or even a single inputstate point at a time instance in terms of . Different segments are also allowed to have overlaps. here is used to denote number of the segments are currently available, and the total number of segments can be very large. Also, in the method developed below, we do not require the knowledge , i.e., the starting time of each segment relative to the starting time of the system trajectory.
Since each segment in may not be sufficient to determine by itself, thus the problem of interest is to develop an IOC algorithm to achieve the estimate of by incrementally incorporating all segments in .
Iii The Proposed Approach
In this section, we first present the idea of how to establish a constraint on the feature weights from any available segment data, then develop the IOC approach.
Iiia Key Idea to Utilize Any Trajectory Segment in IOC
Let be any segment of the full trajectory (2) with . Since the full trajectory is generated by the system (1) minimizing (3), there exist a sequence of Lagrange multipliers (or costates) such that following KKT optimality conditions [3] hold for , that is,
(6) 
where
(7) 
is Lagrangian of the optimization (optimal control) problem. From (6), one has the following equations for any :
(8)  
(9) 
which can also be achieved based on Pontryagin’s maximum principle [20]. It follows that for any trajectory segment , by stacking (8)(9) for one has
(10)  
(11) 
with
(12)  
(13) 
and
(14) 
Dimensions of the above matrices are , , , , and , respectively. In above (10), since is undefined when , we define .
Since the matrix is nonsingular, one can eliminate by combining (10) and (11) and obtain
(15) 
Here
(16)  
(17) 
Note that (15) establishes a relation between any data segment , the unknown weights , and the costate . Note that is unknown and actually related to the value function of future information [9]. In order to further eliminate and measure the contribution of each data segment to solving , we introduce the following concept of data effectiveness for IOC problems.
Definition 1 (Effective Data for IOC).
It follows from Definition 1 that for any effective segment , the corresponding quantity is nonsingular. Thus by multiplying to both sides of (15), one can solve
(19) 
which together with (15) lead to
(20) 
with
(21) 
In summary, we have the following lemma.
Lemma 1 bridges between any dataeffective segment and the unkonwn objective function weights; that is, any effective segment enforces a set of linear constraints to weights . Thus, more dataeffective segments result in more constraints for recovering .
Note that , defined in (17), is uniquely determined by , and , which only rely on the data in and system dynamics . Thus, whether a data segment is effective or not is independent of choices of features , and only determined by the data segment itself and the system model. Furthermore, the effective data condition (18) can be fulfilled efficiently by including additional stateinput points into the current data segment, as suggested in following analysis.
Lemma 2.
For any , one has
(22) 
The proof of Lemma 2 will be given in Appendix. Lemma 2 implies when is nonsingular,
(23) 
The rank nondecreasing property of suggests that the more data points a segment contains, the more likely it will be effective. Indeed, as we will show in later simulations, a segment is usually dataeffective when
(24) 
with being ceiling operator, which is a necessary condition directly suggested by the size of . Specifically, if , even a single state/input point can be effective. Interestingly, from (22) we find that the definition of dataeffectiveness is equivalent to the definition of controllablity for linear systems. This means that for any controllable linear system, as long as the length of a segment satisfies (24), the segment is also dataeffective.
IiiB Incremental IOC from Demonstration Segments
Based on Lemma 1, given a collection of data segments in (5), one has
(25) 
for each segment if it is effective, where is defined in (21). Then for all dataeffective segments in , one has the linear equation of the weights:
(26a)  
with  
(26b)  
Here is a stack of for which the corresponding segment is effective. With all dataeffective segments stacking into matrix , the following Lemma provides a sufficient condition for successfully estimating the unknown weight vector .
Lemma 3.
Given in (26b), if
(27) 
then any vector from the kernel of is the scaled version of weight . i.e., where is a scalar.
Proof.
Lemma 3 is the sufficient condition to guarantee successfully estimation of the true weight . Failure to satisfy this condition indicates that and thus has a kernel with dimension larger than one, which means a vector in the kernel is not guaranteed to be a scaled version of the true weight. To cope with the rank deficiency, more data segments should be included in in order to fulfill the condition (27). It should be also noted that for a single segment, say , even though it is data effective, i.e., , it does not mean that it is able to satisfy the sufficient condition in (27) and suffice for the successful recovery, for example, it could be even though . Recall that only relies on segment data and dynamics, while additionally relies on features, thus is stricter. We will also illustrate this in the later experiment.
In implementation, since the observation noise and/or suboptimality exist, directly computing the weights from (26a) thus may only lead to trivial solutions. Thus, as adopted in previous IOC methods [11, 21, 17, 7], one can choose to obtain a least square estimate for the weights by solving the following equivalent optimization,
(28a)  
subject to  
(28b) 
Here, stands for the norm; and is called a leastsquare estimate to the unknown weights .
Based on the formulation in (28a)(28b), if we consider the segments in are given incrementally, i.e., one segment each time, the following lemma presents an incremental way to solve for the least square estimate .
Lemma 4.
A proof of the above lemma will be given in Appendix. Lemma 4 shows that the least square estimate of the weights in (28) can be achieved incrementally by adding the new segment information to the matrix . As is of fixed dimension, there is not additional memory consumption as new available data is included. Given previous segments, the least square estimate of the unknown weights are solved by (30). Based on Lemma 4, we present the IOC algorithm using demonstration segments in Algorithm LABEL:algorithm.
algocf[h] \end@float
Iv Numerical Experiments
In this section, we evaluate the proposed method on a simulated robot arm and a 6DoF quadrotor UAV system.
Iva Twolink robot arm
As shown in Fig. 1, we consider that a twolink robot arm moves in vertical plane with continuous dynamics given by [24, p. 209]
(31) 
where is the joint angle vector; is the inertia matrix; is the Coriolis matrix; is the gravity vector; and are the torques applied to each joint. The parameters used here follows [24, p. 209]: the link mass , the link length ; the distance from joint to center of mass (COM) , and the moment of inertia with respect to COM . By defining the states and control inputs of the robot arm system
(32) 
respectively, one could write (31) in statespace representation and further approximate it by the following discretetime form
(33) 
where is the discretization interval. The motion of the robot arm is controlled to minimize the objective function (3), which here is set as a weighted distance to the goal state plus the control effort . Here, the corresponding features and weights defined are as follows.
(34) 
The initial condition of the robot arm is set as , and time horizon is set as . We set the groundtruth weights as in (34), and the resulting optimal trajectory of states and inputs is plotted in Fig. 2
In the IOC task, we learn the weight vector from the segment data of the optimal trajectory in Fig. 2. As shown in Table I, we preform five trials, and for each trial we are given a collection of segments of optimal trajectory, as indicated by the corresponding time intervals (second column). We apply Algorithm LABEL:algorithm to obtain the leastsquare estimate for each trial, and show the estimation results in the last column in Table I.
Trial No.  Intervals of segments  Estimate  

Trial 1 


Trial 2  
Trial 3  
Trial 4 


Trial 5 

As shown in Table I, in all trials the algorithm successfully obtains the estimate to the feature weights in (34) except in Trial 4. For all trials, we have randomly selected segments sparsely located in the time horizon. It is noted that the first segment in Trial 1, the second segment in Trial 2, and the second segment in Trial 3 just reach the lower bound length given by (24), i.e., , they all satisfy the effective data condition in (18). This indicates the mild requirement of the effective data condition. Moreover, we have tested other data segments of the trajectory, and observed that most of trajectory segments are effective as long as its length reaches the lower bound, except for those that are very near the end of the trajectory, such as the segments within the time interval . This is because, as the system trajectory in Fig. 2 finally converges to zero, the states and inputs at the end of time horizon are very close to zeros (lowexcitation) and thus likely become noneffective.
In Table I, it is also worth noting that Trial 4 fails to recover the true weight vector. This is because although the segment in Trial 4 is dataeffective (i.e., ), however, and thus it does not meet the sufficient condition stated in lemma 3 for successful estimation. To address this, we add another segment , as shown in Trial 5, in order to fulfill the sufficient condition in (27), and now . Therefore, Trial 5 successfully estimates the true weight vector .
The above results show that the dataeffectiveness condition (18) is mild and easy to fulfill, for example, let the length of segment reach the lower bound. Dataeffectiveness is a precondition for a segment to be used for solving IOC problems. Although a single segment is dataeffective, it may not necessarily suffice for recovering the weight. The IOC sufficient condition (27) is stricter than the dataeffectiveness condition (18), because only relies on segment data and dynamics, while additionally relies on features.
IvB Quadrotor UAV
Next, we apply the proposed method to learn the objective function for a 6DoF quadrotor UAV maneuvering system. Consider a quadrotor UAV with the following dynamics
(35)  
Here, the subscription and denote a quantity is expressed in the body frame and inertial (world) frame, respectively; and are the mass and moment of inertia with respect to body frame of the UAV, respectively. and are the position and velocity vector of the UAV; is the angular velocity vector of the UAV; is the unit quaternion [13] that describes the attitude of UAV with respect to the inertial frame; is defined as:
(36) 
is the torque applied to the UAV; is the force vector applied to the UAV center of mass. The total force magnitude (along zaxis of the body frame) and torque are generated by thrust from four rotating propellers , their relationship can be expressed as:
(37) 
where is the wing length of the UAV and is a fixed constant. Similar to (33), we discretize the above dynamics with discretization interval of 0.1s. The parameters in dynamics are given in Table II.
Parameters  Value  Unit 

diag([1,1,5])  
10  
0.4  m  
0.01  
1  kg 
The state and input vectors of the UAV are defined as:
(38)  
The control objective function of the UAV includes a carefully selected attitude error term. As used in [14], we define the attitude error between UAV’s current attitude and the goal attitude as:
(39) 
where is the direction cosine matrix [13] directly corresponding to the quaternion . Other error terms that are included in the control objective function are simply the squared distances to their corresponding goals:
(40)  
where and are the goal position and velocity states respectively.
We generate the UAV optimal trajectory by minimizing a given control objective function. The initial state is set as , and the goal state is set as . The control objective function is written as the weighted distance to the goal state plus the control effort , where the features and weights are defined as follows:
(41) 
with corresponding weights:
(42) 
The time horizon is set to .
Similar to the previous experiment, we set up four trials, and for each trial we observe different segments of the optimal state trajectories, as listed in the second column in Table III. The result of feature weights estimation is shown the last column in Table III.
Trial No.  Intervals of segments  Estimate  

Trial 1  
Trial 2 
