Analysis of A Nonsmooth Optimization Approach to Robust Estimation\thanksreffootnoteinfo

# Analysis of A Nonsmooth Optimization Approach to Robust Estimation\thanksreffootnoteinfo

[    [
###### Abstract

In this paper, we consider the problem of identifying a linear map from measurements which are subject to intermittent and arbitarily large errors. This is a fundamental problem in many estimation-related applications such as fault detection, state estimation in lossy networks, hybrid system identification, robust estimation, etc. The problem is hard because it exhibits some intrinsic combinatorial features. Therefore, obtaining an effective solution necessitates relaxations that are both solvable at a reasonable cost and effective in the sense that they can return the true parameter vector. The current paper discusses a nonsmooth convex optimization approach and provides a new analysis of its behavior. In particular, it is shown that under appropriate conditions on the data, an exact estimate can be recovered from data corrupted by a large (even infinite) number of gross errors.

\setstretch

1

AMP-ECL-UnivLyon]Laurent Bako

Laboratoire Ampère – Ecole Centrale de Lyon – Université de Lyon, France

Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, CA, USA

Key words:  robust estimation, outliers, system identification, nonsmooth optimization.

11footnotetext: This paper was not presented at any IFAC meeting. Corresponding author L. Bako. Tel.: +33 472 186 452.

## 1 Introduction

### 1.1 Problem and motivations

We consider a linear measurement model of the form

 yt=x⊤tθo+ft+et (1)

where is the measured signal, the regression vector, a sequence of zero-mean and bounded errors (e.g., measurement noise, model mismatch, uncertainties, etc.) and a sequence of intermittent and arbitrarily large errors. Assume that we observe the sequences and and would like to compute the parameter vector from these observations. We are interested in doing so without knowing any of the sequences and . We do however make the following assumptions:

• is a bounded sequence.

• is a sequence containing zeros and intermittent gross errors with (possibly) arbitrarily large magnitudes.

This is an important estimation problem arising in many situations such as fault detection [30, 11], hybrid system identification [16], subspace clustering [40, 2], error correction in communication networks [7]. The case when is zero and is a Gaussian process has been well-studied in linear system identification theory (see, e.g., the text books [23, 38]). A less studied, but very relevant scenario in the system identification community, is when the additional perturbation in (1.1) is nonzero and contains intermittent and arbitrarily large errors [7, 37, 26, 42]. It is worth noticing the difference with the problem studied in the field of compressive sensing [7, 13, 10]. In compressive sensing, the sought parameter vector is assumed sparse and the measurement noise , often Gaussian or bounded. Here, no assumptions are made concerning sparsity of . We will, in this contribution, study essentially the case when the data is noise-free (i.e., for all ) and is a sequence with intermittent gross errors. We will derive conditions for perfect recovery and point to effective algorithms for computing . In the second part of the paper, the model assumption is relaxed to allow both and to be simultaneously nonzero. Note that this might be a more realistic scenario since most applications have measurement noise.
For illustrative purposes, let us discuss briefly some applications where a model of the form (1.1) is of interest.

#### 1.1.0.1 Switched linear system identification

A discrete-time Multi-Input Single-Output (MISO) Switched Linear System (SLS) can be written in the form

 yt=x⊤tθoσt+et, (2)

where is the regressor at time defined by

 xt=[yt−1⋯yt−nau⊤tu⊤t−1⋯u⊤t−nb]⊤, (3)

where and denote respectively the input and the output of the system. The integers and in (1.1.0.1) are the maximum output and input lags (also called the orders of the system). is the discrete mode (or discrete state) indexing the active subsystem at time ; it is in general assumed unobserved. , , is the parameter vector (PV) associated with the mode . For , the Switched Auto-Regressive eXogenous (SARX) model (1.1.0.1) can be written in the form (1.1), with unknown of the following structure . For a background on hybrid system identification, we refer to the references [32, 16, 41, 1, 25, 31, 28].

#### 1.1.0.2 Identification from faulty data

A model of the form (1.1) also arises when one has to identify a linear dynamic system which is subject to intermittent sensor faults. This is the case in general when the data are transmitted over a communication network [7, 30]. Model (1.1) is suitable for such situations and the sequence then models occasional data packets losses or potential outliers. More precisely, a dynamic MISO system with process faults can be directly written in the form (1.1). In the case of sensor faults, the faulty model might be defined by

 {¯yt=¯x⊤tθo+etyt=¯yt+wt

where is the observed output which is affected by the fault (assumed to be nonzero only occasionally) ; is defined as in (1.1.0.1) from the known input and the unobserved output . We can rewrite the faulty model exactly in the form (1.1) with Sparsity of induces sparsity of but in a lesser extent.

#### 1.1.0.3 State estimation in the presence of intermittent errors

Considering a MISO dynamic system with state dynamics described by and observation equation , being known matrices of appropriate dimensions, and a sparse sequence of possibly very large errors, the finite horizon state estimation problem reduces to the estimation of the initial state . We get a model of the form (1.1) by setting and , with , . Examples of relevant works are those reported in [3, 15]. In this latter application, it can however be noted that the dataset may not be generic enough. \@footnotemark\@footnotetextIn this paper, the term genericity for a dataset characterizes a notion of linear independence. For example, a set of data points in general linear position in is more generic than a set of data points contained in one subspace. We will introduce different quantitative measures of data genericity in the sequel (see Definition 2 and Theorem 11).

#### 1.1.0.4 Connection to robust statistics

Indeed, the problem of identifying the parameters from model (1.1) under the announced assumptions can be viewed as a robust regression problem where the nonzero elements in the sequence are termed outliers. As such, it has received a lot of attention in the robust statistics literature (see, e.g., [21, 35, 24] for an overview). Examples of methods to tackle the robust estimation problem include the least absolute deviation [20], the least median of squares [34], the least trimmed squares [35], the M-estimator [21], etc. Most of these estimators come with an analysis in terms of the breakdown point [19, 36], a measure of the (asymptotic) minimum proportion of points which cause an estimator to be unbounded if they were to be arbitrarily corrupted by gross errors. The current report focuses on the analysis of a nonsmooth convex optimization approach which includes the least absolute deviation method as a particular case corresponding to the situation when the output in (1.1) is a scalar. The analysis approach taken in the current paper is different in the following sense.

• In robust statistics the quality of an estimator is measured by its breakdown point. The higher the breakdown point, the better. The available analysis is therefore directed to determining a sort of absolute robustness: how many outliers (expressed in proportion of the total number of samples) cause the estimator to become unbounded.

• Here, the question of robust performance of the estimator is posed differently. We are interested in estimating the maximum number of outliers that a nonsmooth-optimization-based estimator can accommodate while still returning the exact value one would obtain in the absence of any outlier. This is more related to the traditional view developed in compressive sensing.

#### 1.1.0.5 Contributions of this paper

One promising method for estimating model (1.1) is by nonsmooth convex optimization as suggested in [7, 37, 1, 26, 42]. More precisely, inspired by the recent theory of compressed sensing [7, 13, 10], the idea is to minimize a nonsmooth (and non differentiable) sum-of-norms objective function involving the fitting errors. Under noise-free assumption, such a cost function has the nice property that it is able to provide the true parameter vector in the presence of arbitrarily large errors provided that the number of nonzero errors is small in some sense. Of course, when the data are corrupted simultaneously by the noise and the gross errors , the recovery cannot be exact any more. It is however expected (as Proposition 17 and simulations tend to suggest) that the estimate will still be close to the true one.
The current paper intends to present a new analysis of the nonsmooth optimization approach and provide some elements for further understanding its behavior. The line of analysis goes from a full characterization of the nonsmooth optimization based estimator (both for SISO and MIMO systems) to the study of robustness to outliers including in the presence of dense noise. With respect to relevant works [7, 37, 1, 26, 42], we derive new bounds on the number of outliers (in the least favorable situations) that the estimator is capable to accommodate. It is emphasized that a quite broad spectrum of such bounds can be derived based on the basic characterization of the nonsmooth identifier. Note however that evaluating numerically the tightest of these bounds is a high computational process while less tight bounds have a more affordable complexity. Some of the bounds developed in this contribution meet both relative tightness requirement and computability in polynomial time (see the bound based on in Theorem 11). Finally, the paper show how the results derived in the first part for -norm estimator when applied to the estimation of SISO systems are generalizable to multivariable systems.

#### 1.1.0.6 Outline of this paper

The outline of the paper is as follows. We start in Section 2.1 by viewing the nonsmooth optimization as the convex relaxation of a (ideal) combinatorial -"norm" formulation. We then derive in Section 2.2 necessary and sufficient conditions for optimality. Based on those conditions we establish in Section 2.4 new sufficient conditions for exact recovery of the true parameter vector in (1.1). The noisy case is treated in Section 3.2. Section 4 presents a generalization of the earlier discussions to multi-output systems. Finally, numerical experiments are described in Section 5 and concluding remarks are given in Section 6.

### 1.2 Notations

Let be the index set of the measurements. For any , define a partition of the set of indices by , , . Cardinality of a finite set. Throughout the paper, whenever is a finite set, the notation will refer to the cardinality of . However, for a real number , will denote the absolute value of .
Submatrices and subvectors. Let be the matrix formed with the available regressors . If , the notation denotes a matrix in formed with the columns of indexed by . Likewise, with , is the vector in formed with the entries of indexed by . We will use the convention that (resp. ) when the index set is empty.
Vector norms. , , denote the usual -norms for vectors defined for any vector , by . Note that . The "norm" of is defined to be the number of nonzero entries in , i.e., .
Matrix norms. The following matrix norms will be used: , , , . They are defined as follows: for a matrix with , \setstretch.8

 ∥A∥p=supx∈RN,∥x∥p=1∥Ax∥p,∥A∥2,col=N∑i=1∥ai∥2,

## 2 Nonsmooth optimization for the estimation problem

### 2.1 Sparse optimization

The main idea for identifying the parameter vector from (1.1) is by solving a sparse optimization problem, that is, a problem which involves the minimization of the number of nonzeros entries in the error vector. To be more specific, assume for the time being that the error sequence is identically equal to zero. Consider a candidate parameter vector and let

 ϕ(θ)=y−X⊤θ,

where , , be the fitting error vector induced by on the experimental data. Then the vector can naturally be searched for by minimizing an objective function,

 minimizeθ∈Rn∥ϕ(θ)∥0 (4)

where denotes the pseudo-norm which counts the number of nonzero entries. Because problem (4) aims at making the error sparse by minimizing the number of nonzero elements (or maximizing the number of zeros), it is sometimes called a sparse optimization problem [1]. As can be intuitively guessed, the recoverability of the true parameter vector from (4) depends naturally on some properties of the available data. This is outlined by the following lemma.

###### Lemma 1 (A sufficient condition for ℓ0 recovery)

Assume that is equal to zero and let Assume that for any with , whenever , with referring here to range space. Then provided , it holds that

 θo∈argminθ∥ϕ(θ)∥0. (5)

PROOF. We proceed by contradiction. Assume that (5) does not hold, i.e., . Then, by letting be any vector in , the above inequality translates into . It follows that because is the exact (largest) number of zero elements in the sequence . Note also that we have necessarily . On the other hand, with , it can be seen that . This, together with , constitutes a contradiction to the assumption of the Lemma. Hence, (5) holds as claimed.

Lemma 1 specifies a condition involving both and and under which lies in the solution set but it does not ensure that will be recovered uniquely from data. Before proceeding further, we recall from [1] a sufficient condition under which is the unique solution to (4).

###### Definition 2 ([1] An integer measure of genericity)

Let be a data matrix satisfying . The -genericity index of denoted , is defined as the minimum integer such that any submatrix of has rank ,

 νn(X)=min{m:∀S⊂I with |S|=m,rank(XS)=n}. (6)
###### Theorem 3 ([1] Sufficient condition for ℓ0 recovery)

Assume that the sequence in (1.1) is identically equal to zero. If the sequence in (1.1) contains enough zero values in the sense that

 ∣∣I0(θo)∣∣=∣∣{t∈I:ft=0}∣∣≥N+νn(X)2, (7)

then is the unique solution to the -norm minimization problem (4).

In other words, if the number of nonzero gross errors affecting the data generated by (1.1) does not exceed the threshold , then can be exactly recovered by solving (4). Unfortunately, this problem is a hard combinatorial optimization problem. A tractable solution can be obtained by relaxing the -norm into its best convex approximant, the -norm. Doing this substitution in (4) gives

 minimizeθ∈Rn∥ϕ(θ)∥1 (8)

with . The latter problem is termed a nonsmooth convex optimization problem [27, Chap. 3] because the objective function is convex but non-differentiable. Compared to (4), problem (8) has the advantage of being convex and can hence be efficiently solved by many existing numerical solvers, e.g., [18]. Note further that it can be written as a linear programming problem. The relaxation process has been intensively used in the compressed sensing literature [14] for approaching the sparsest solution of an underdetermined set of linear equations. In the robust statistics literature as surveyed above, (8) corresponds to a well-known estimator referred to as the least absolute deviation estimator [34]. As will be shown next, the underlying reason why problem (8) can obtain the true parameter vector despite the presence of gross perturbations is related to its nonsmoothness.

### 2.2 Solution to the ℓ1 problem

There is a wealth of analysis in the literature of compressed sensing investigating under which conditions some problems\@footnotemark\@footnotetextThose problems look for the sparsest solution to an underdetermined set of linear equations. As such they are similar but different to the problem studied in the current paper. Note that the process of converting problems (4) and (8) into the format treated in compressed sensing yields a system of linear equations which is much less underdetermined. of similar structure as (4) and (8) can yield the same solution. This analysis is mainly based on the concepts of mutual coherence [14] and the Restricted Isometry Property [8]. Here, we shall propose a parallel but different analysis for the robust estimation problem. We start by characterizing the solution to the -norm problem (8).

###### Theorem 4 (Solution to the ℓ1 problem)

A vector solves the -norm problem (8) if and only if any of the following equivalent statements hold:

1. There exist some numbers , , such that\@footnotemark\@footnotetextEq. (9) should be understood here with the implicit convention that any of the three terms is equal to zero whenever the corresponding index set is empty.

 ∑t∈I+(θ⋆)xt−∑t∈I−(θ⋆)xt=∑t∈I0(θ⋆)λtxt. (9)
2. For any ,

 ∣∣∑t∈I+(θ⋆)η⊤xt−∑t∈I−(θ⋆)η⊤xt∣∣≤∑t∈I0(θ⋆)∣∣η⊤xt∣∣. (10)
3. The optimal value of the optimization problem

 minα∥α∥∞ subject to z=XI0(θ⋆)α, (11)

where , , is less than or equal to .

Moreover, the solution is unique if and only if any of the following statements is true:

1. (9) holds and where

 S={t∈I0(θ⋆):|λt|<1}.
2. (10) holds with strict inequality symbol for all , .

PROOF.

#### 2.2.0.1 Proof of S1

Since is a proper convex function, it has a non empty subdifferential [33]. The necessary and sufficient condition for to be a solution of (8) is then

 0∈∂∥ϕ(θ⋆)∥1, (12)

where the notation refers to subdifferential with respect to . Indeed, using additivity of subdifferentials, it is straightforward to write

 ∂∥ϕ(θ⋆)∥1=∑t∈I+(θ⋆)xt−∑t∈I−(θ⋆)xt+∑t∈I0(θ⋆)conv{−xt,xt} (13)

where refers to the convex hull. Here, the addition symbol is meant in the Minkowski sum sense. It follows that is equivalent to the existence of a set of numbers in , , such that (9) holds.

#### 2.2.0.2 Proof of S2

Define two functions by and . Then and is differentiable at . It follows that , where is the gradient of at . We can hence write

 θ⋆ minimizes ∥ϕ(θ)∥1 ⇔0∈∂∥ϕ(θ⋆)∥1 ⇔−∇q(θ⋆)∈∂h(θ⋆).

Note from (13) that so that if and only if and this in turn is equivalent to , for . It follows that minimizes if and only if

 ∣∣∇q(θ⋆)⊤(θ−θ⋆)∣∣≤h(θ)−h(θ⋆)=∑t∈I0(θ⋆)∣∣(θ−θ⋆)⊤xt∣∣ (14)

for all . The last equality is obtained by using the fact that for all in . Finally the result follows by setting and noting that .

#### 2.2.0.3 S1 ⇔ S3

The proof of the last equivalence is immediate.

#### 2.2.0.4 Uniqueness

For convenience, we first prove S2’. Along the lines of the proof of S2 (see Eq. (14) and preceding arguments), we can see that strict inequality in (10) is equivalent to the following strict inequality . On the other hand, . Summing the two yields

 ∥ϕ(θ⋆)∥1=q(θ⋆)+h(θ⋆)

Hence S2’ is proved.
For the proof of S1’, we proceed in two steps.
Sufficiency. Assume . Then for any nonzero vector there is at least one such that . Recall that by definition of . It follows that by multiplying (9) on the left by with an arbitrary nonzero vector, and taking the absolute value yields (10) with strict inequality symbol. We can therefore apply the proof of S2’ to conclude that is unique.
Necessity. Assume . Then pick any nonzero vector in . Set with . Indeed can be chosen sufficiently small such that has the same sign as for . For such values of we have and . Moreover, since , , so that . Finally, it remains to re-assign the indices contained in for which . We get the following partition , , . It follows that also satisfies (9) with the sequence and is therefore a minimizer. In conclusion, if , the minimizer cannot be unique.

• One first consequence of the theorem is that can be computed exactly from a finite set of erroneous data (by solving problem (8)) provided it satisfies the conditions S1’ or S2’ of the theorem. Note that there is no explicit boundedness condition imposed on the error sequence . Hence the nonzero errors in this sequence can have arbitrarily large magnitudes as long as the optimization problem makes sense, i.e., provided remains finite.

• Second, the true parameter vector can be exactly recovered in the presence of, say, infinitely many nonzero errors (see also Proposition 6). For example, if the regressors satisfy

 ∑t∈I+(θo)xt−∑t∈I−(θo)xt=0,

and , then by condition S2’ is the unique solution to problem (8) regardless of the number of errors affecting the data.

• Third, if problem (8) admits a solution that satisfies for all , then is non-unique. In effect, in this case and so, which, by Theorem 4, implies non-uniqueness. Indeed this is typically the case whenever the noise is nonzero.

Another immediate consequence of Theorem 4 can be stated as follows.

###### Corollary 5 (On the special case of affine model)

If the model (1.1) is affine in the sense that the regressor has the form , with , then a necessary condition for to solve problem (8) is that

 ∣∣∣∣I+(θ⋆)∣∣−∣∣I−(θ⋆)∣∣∣∣≤∣∣I0(θ⋆)∣∣. (15)

Here, the outer bars refer to absolute value while the inner ones which apply to sets refer to cardinality.

PROOF. The proof is immediate by considering the condition (10) and taking .

Eq. (15) implies that if the measurement model is affine and all the ’s have the same sign, i.e., if one of the cardinalities or is equal to zero, then problem (8) cannot find the true whenever more than of the elements of the sequence are nonzero.
Next, we discuss a special case in which the true parameter vector in (1.1) can, in principle, be obtained asymptotically in the presence of an infinite number of nonzero errors ’s.

###### Proposition 6 (Infinite number of outliers)

Assume that the error sequence in (1.1) is identically equal to zero. Assume further that the data are generated such that:

• There is a set with , such that for any , and ,

• For any , is sampled from a distribution which is symmetric around zero.

• The regression vector sequence is drawn from a probability distribution having a finite first moment.

Then

 limN→∞argminθ∈Rn1NN∑t=1∣∣yt−x⊤tθ∣∣={θo} (16)

with probability one.

PROOF. Under the conditions of the proposition, we have , where denotes probability measure. It follows that and go jointly to infinity as the total number of samples tends to infinity. Hence, the expressions and are both sample estimates for the expectation of the process . By the law of large numbers, as , the two quantities converge to the true expectation of the process with probability one, so that

 limN→∞[1∣∣I+(θo)∣∣∑t∈I+(θo)xt−1∣∣I−(θo)∣∣∑t∈I−(θo)xt]=0.

As a consequence, satisfies condition S1’ of Theorem 4 asymptotically with for any . Hence the solution of the minimization problem tends to with probability one as the number of samples approaches infinity.

### 2.3 Worst-case necessary and sufficient conditions

The conditions (9)-(11) derived in Theorem 4 characterize completely the solution to the -problem. However such conditions depend on which data points are affected by the gross errors and on the sign of the . We wish now to find necessary and sufficient conditions that depend solely on the number of gross errors (or, equivalently on the number of zero elements in the sequence ).

###### Corollary 7 (Necessary and sufficient conditions)

Let be an integer. Then the following statements are equivalent: \setstretch.3

1.  ∀θ∈Rn,∀y∈ RN,|I0(θ)|≥d (17) ⇒θ∈argminw∈Rn∥ϕ(w)∥1
2.  (18)
3.  max(I,Ic):|I|=dmaxh∈{±1}∣∣Ic∣∣minα∈R|I|{∥α∥∞s.t.XIch=XIα}≤1 (19)

In (18)-(19) and similar equations in the paper, the leftmost maximum is taken over the set of those partitions of that satisfy . Eq. (19) should be read with the implicit assumption that the inequality fails to hold whenever the optimization problem is not feasible.

PROOF. [of Corollary 7] That (ii) and (iii) are equivalent is a statement that results directly from the equivalence of (10) and (11) in Theorem 4. To see this, let be a solution to problem (8) and set , , such that if and if . Then Eq. (10) can be written as

 ∣∣η⊤XIchIc∣∣≤∥∥X⊤Iη∥∥1∀η∈Rn (20) ⇔

 minαI{∥αI∥∞s.t.XIchIc=XIαI}≤1. (21)

The equivalence (ii) (iii) then follows by applying the chains of maximums to each of the equations (20) and (21) and noting that .

We shall now establish the equivalence (i) (ii). Let and be any vectors such that . The so-defined can be any subset of provided . Hence any satisfying this cardinality constraint solves problem (8) if and only if (20) holds for any partition of with and for any . This is equivalent to Eq. (18).

Finally, let us observe that

 max(I,Ic):|I|=dmaxη∈Rn{∥∥X⊤Icη∥∥1s.t.∥∥X⊤η∥∥1=1}

is a decreasing function of so that if (18) holds for some , it holds also for any . It follows that (i) (ii), hence completing the proof.

It should be mentioned that the equivalence (i) (ii) was also obtained in earlier papers, see e.g., [42, 43]. Uniqueness of the solution follow in a similar way as in the proof of Corollary 7 by invoking conditions S1’ and S2’ of Theorem 4.

###### Corollary 8 (Uniqueness)

Let be an integer. Then the following statements are equivalent. \setstretch.2

1.  ∀θ∈Rn,∀y∈ RN,|I0(θ)|≥d (22) ⇒argminw∈Rn∥ϕ(w)∥1={θ}
2. Eq. (18) holds with strict inequality.

3.  max(I,Ic):|I|=d maxh∈{±1}∣∣Ic∣∣minα{∥α∥∞s.t. (23) XIch=XIα, ∃S⊂I,rank(XS)=n,∥αS∥∞<1}≤1
###### Remark 9

Corollaries 7 and 8 imply the following. If there exists an integer such that (18) or (19) holds and , then . It follows under these conditions that whenever solves the problem (4), it also solves the problem (8). In particular .

It should be noted that when the data are noise-free, there always exists a such that (17)-(19) hold. For example is the maximum possible value that satisfies these conditions. Let us denote by the minimum integer such that the conditions (17)-(19) hold, that is,

 πo(X)=min{d∈Is.t. Eq. (???) is true}. (24)

Such a number depends only on the matrix . It can be viewed as a measure of the richness properties of the regressor matrix . Recoverability of the true parameter vector by the least -norm estimator (8) in the face of gross errors is enhanced when is small. We may hence say that the smaller , the richer (or more generic) the regressors in are.

Computing directly from the definition (24) is a hard combinatorial problem with a complexity comparable to that of the problem (4). An algorithm of slightly reduced complexity but still combinatorial has been derived in [37] for this purpose. Here, we ask the question of whether can be more cheaply estimated in a somewhat efficient way. Such estimates are most likely over-estimates and lead to sufficient conditions for exact recoverability of the parameter vector in the presence of gross errors sequence .

### 2.4 Sufficient conditions of recoverability by convex optimization

We start by introducing the following notations :

 v1(k)=max(I,Ic):|I|=k≥νn(X)∥∥X⊤I(XIX⊤I)−1XIc∥∥∞ (25) v2(k)=max(I,Ic):|I|=k∥∥X⊤Ic(XX⊤)−1X∥∥