Instance Optimal Decoding and the Restricted Isometry Property

# Instance Optimal Decoding and the Restricted Isometry Property

Nicolas Keriven          Rémi Gribonval
###### Abstract

In this paper, we address the question of information preservation in ill-posed, non-linear inverse problems, assuming that the measured data is close to a low-dimensional model set. We provide necessary and sufficient conditions for the existence of a so-called instance optimal decoder, , that is robust to noise and modelling error. Inspired by existing results in compressive sensing, our analysis is based on a (Lower) Restricted Isometry Property (LRIP), formulated in a non-linear fashion. We also provide sufficient conditions for non-uniform recovery with random measurement operators, with a new formulation of the LRIP. We finish by describing typical strategies to prove the LRIP in both linear and non-linear cases, and illustrate our results by studying the invertibility of a one-layer neural net with random weights.

• École Normale Supérieure. 45 rue d’Ulm, 75005 Paris, France.

• E-mail: nicolas.keriven@ens.fr

• Université Rennes 1, Inria, CNRS, IRISA. F-35000 Rennes, France.

• E-mail: remi.gribonval@inria.fr

## 1 Introduction

Inverse problems are ubiquitous in all areas of data science. While linear inverse problems have been arguably far more studied in the literature, some frameworks are intrinsically non-linear [14]. In this paper, we aim at giving a characterization of the preservation of information in ill-posed inverse problems, regularized by the introduction of a “low-dimensional” model set close to which the data of interest is assumed to live. We consider a very general context that includes, in particular, measurement operators that are possibly non-linear, and an ambient space that can be any pseudometric set. Our main results show that the existence of a decoder that is robust to noise and modelling error is equivalent to a modified Restricted Isometry Property (RIP), which is a classical property in compressive sensing [9]. We thus outline the fundamental nature of the RIP in settings that are more general than previously studied.

The problem is formulated as follows. Let be a set equipped with a pseudometric\@footnotemark\@footnotetextA pseudometric satisfies all the requirements of a metric except . , the set of data, and a seminormed\@footnotemark\@footnotetextSimilarly, a seminorm satisfy the requirements of a norm except that does not imply . vector space, the space of measurements. Consider a (possibly non-linear) measurement map . The measured vector is:

 y=Ψ(x⋆)+e (1)

where is measurement noise and is the true signal. Our goal is to characterize the existence of any procedure that would allow us to approximately recover the data from .

#### Regularization.

In most interesting problems, the “dimension” of the space is far lower than that of the set (in a loose sense: we recall that here none is required to be finite-dimensional, and is not necessarily a vector space), which makes the problem ill-posed, meaning that there are information-theoretic limits that prevent us from recovering the underlying signal from the measurements, even in the noiseless case. A classical regularization technique is to introduce prior knowledge about the true signal , here we consider a model set of “simple” signals, such that is likely to be close to . For instance, sparsity, the assumption that the true signal is a linear combination of a few elements in a well-chosen dictionary, is a hugely-studied prior in modern signal processing, in particular in compressive sensing [15].

#### Instance Optimal Decoding.

Ideally, a decoder must be able to exactly retrieve from when the modelling is exact ( ) and the noise is zero (). However, as these conditions are highly unrealistic in practice, it is desirable for this decoding process to be both robust to noise and stable to modelling error. In the literature, such a decoder is said to be instance optimal [12] (see Def. 1). In this paper, our goal is to characterize necessary and sufficient conditions for the existence of an instance optimal decoder for the problem (1). Note that we will not study the existence of efficient algorithms to solve (1), which is another significant achievement of compressive sensing [15], but only the preservation of information of the encoding process.

In [12, 8], the authors outlined the crucial role played by the Restricted Isometry Property (RIP), and more precisely by the Lower-RIP (LRIP), for the existence of instance optimal decoders in the linear case. In this paper, we extend these results to the non-linear case and to non-uniform probabilistic recovery.

#### Outline of the paper.

The structure of the paper is as follows. In Section 2 we briefly outline some relevant references, keeping in mind that the field is large and we do not pretend to be exhaustive in this short paper. In Section 3 we state our main results relating instance optimal decoders (Def. 1) and the LRIP (Def. 2). In Section 4 we outline how one might typically prove the LRIP by extending a classical proof [2], and illustrate it on a simple example.

## 2 Related Work

#### Classical Compressive Sensing: the linear case.

Instance optimal decoding and the RIP are well-known notions in compressive sensing [11, 10, 13]. We refer to the book of Foucart and Rauhut [15] for a review of the field, in particular to Chapters 6 and 11 for the topics of interest here. The interplay between the two notions was in particular studied in [12] in the finite-dimensional case. These results were later extended to more general models in [21], and to any linear measurement operators in [8], which is the main inspiration behind the present work.

#### Non-uniform decoding.

In compressed sensing the measurement operator is often designed at random. Typical recovery results are therefore given with high probability. One can then distinguished between uniform guarantees, meaning that with high probability on the draw of all signals close to can be stably recovered, and non-uniform guarantees, for one fixed signal close to , with high probability on the decoding is successful. In [12] the authors study non-uniform instance optimality, but only under the light of the classical uniform RIP. In this paper we introduce a non-uniform version of the LRIP and prove that it is sufficient for non-uniform instance optimality.

#### Non-linear inverse problems.

Non-linear inverse problems can be found in many areas of signal processing, see e.g. [14] for a review of some applications. They have also been considered by the compressive sensing community, often when quantization occurs [19, 7], in the so-called “1-bit” compressed sensing line of work [6]. Another focus is the development of efficient algorithms inspired by the linear case [3, 4]. In [4], the author assume that a locally linearized version of satisfy the classical RIP. In this paper we consider a different, “fully” non-linear RIP. We note that one notion does not imply the other.

## 3 Equivalence between IOP and LRIP

In this section we state our main results on instance optimal decoders and the LRIP. We distinguish the case where the operator is deterministic, or, equivalently, when it is random but one seeks so-called uniform recovery guarantees, and the case of non-uniform recovery.

### 3.1 Deterministic operator

Recall that we consider a pseudometric set , a seminormed vector space , and measurements of the form where . We consider a model set , and are interested in characterizing the existence of a good decoder that takes and as inputs and return a signal that is close to . We want this decoder to be stable to modelling error and robust to noise, which is characterized by the notion of instance optimality.

###### Definition 1 (Instance Optimality Property (IOP)).

A decoder satisfies the Instance Optimality Property for the operator and model with constants , pseudometrics on and error if: for all signals and noise , denoting the recovered signal, it holds that:

where .

As indicated by the r.h.s. of (1), the decoding error between the true signal and the recovered one is bounded by the amplitude of the noise and the distance from to the model set, which indicates how well is modelled by . An instance optimal decoder is therefore robust to noise and stable even if is not exactly in the model set. We also include a possible fixed additional error , which may be unavoidable in some cases (due to algorithmic precision for instance). Ideally, one has .

Let us now turn to the proposed non-linear version of the LRIP. As described in [8], the LRIP is just one side of the classical RIP, which states that the measurement operator approximately preserves distances between elements of the model .

###### Definition 2 (Lower Restricted Isometry Property (LRIP)).

The operator satisfies the Lower Restricted Isometry Property for the model with constant , pseudometric and error if: for all it holds that

 dE(x,x′)≤α∥∥Ψ(x)−Ψ(x′)∥∥F+η. (3)

The LRIP expresses the fact that must not collapse two elements of the model together. Like the IOP, we allow for a possible additional fixed error in the LRIP. Note that this type of error is often considered when introducing quantization [7, 19]. Ideally, one has , however in some cases it can be considerably simpler to prove that the LRIP holds with a non-zero [20]. The reader would note that the classical RIP is often expressed with a constant where is a small as possible.

We now state our main result. The proof, rather direct, can be found in Appendix A.

###### Theorem 1 (Equivalence between IOP and LRIP.).

Consider an operator and a model .

1. If there exists a decoder which satisfies the Instance Optimality Property for and with constants , pseudometrics and error , then the operator satisfies the LRIP for with constant , pseudometric and error .

2. If the operator satisfies the LRIP for the model with constant , pseudometric and error , then the decoder defined as\@footnotemark\@footnotetextIn this paper we assume that the minimization problem has at least one solution, for simplicity (ties can be broken arbitrarily). When this is not the case, it is possible to consider a decoder that returns any element that approaches the infimum with a fixed precision, at the expense of having this precision in the decoding error , as in [8].

 Δ(Ψ,y)=argminx∈S∥Ψ(x)−y∥F (4)

satisfies the Instance Optimality Property for the operator and model with constants and , pseudometrics and where is defined by , and error .

Theorem 1 states that if the LRIP is satisfied, then the decoder that returns the element in the model that best matches the measurement is instance optimal, with a special metric . On the other hand, if some instance optimal decoder exists, then the LRIP must be satisfied. In other words, when the LRIP is satisfied, then we know that a negligible amount of information is lost when encoding a signal well-modeled by . Conversely, if the LRIP is not satisfied, one has no hope of deriving an instance optimal decoder.

### 3.2 Random operator, from uniform recovery to non-uniform recovery

In the vast majority of the compressive sensing literature, the measurement process is drawn at random: for instance, in the finite dimensional case, it is an open problem to find deterministic matrices that satisfies the RIP with an optimal number of measurements ([15], pp. 27), while on the contrary many classes of random matrices satisfy the RIP with high probability [2].

A well-studied concept is that of uniform recovery guarantees, where one shows that, with high probability on the draw of , the LRIP holds. It follows by Theorem 1 that there is a decoder such that, with high probability on the draw of , all signals from can be stably recovered. There is also a notion of non-uniform recovery, where one considers a decoder and wonders if, given an arbitrary signal close to , this signal is stably recovered (with high probability on the draw of ) from . In this section we introduce a non-uniform version of the LRIP, and show that it is a sufficient condition for the existence of a non-uniform instance optimal decoder. We start by discussing a notion of projection on the model.

###### Remark 1 (Approximate projection.).

As we will see, in non-uniform recovery the distance from to is replaced by the distance from to a particular element , where is a “projection” function with respect to some metric . In full generality, it is not guaranteed that there exists such that , but one can always define it such that for all , for an arbitrary small .

Let us now introduce the proposed non-uniform IOP and LRIP.

###### Definition 3 (Non-uniform IOP).

A decoder satisfies the non-uniform Instance Optimality Property for the (random) mapping , model and projection function , with constants , pseudometrics , probability and error if:

where is denoted by .

Note that in this definition the IOP is non-uniform with respect to the data but uniform with respect to the noise , meaning that with high probability on the draw of the (fixed) data can be stably recovered from a measurement vector with any additive noise.

###### Definition 4 (Non-uniform LRIP).

The operator satisfies the non-uniform LRIP for the model with constant , pseudometric , probability and error if:

 (6)

This LRIP is in fact “semi”-uniform: it is non-uniform with respect to one element but uniform with respect to . A “fully” non-uniform LRIP would, in fact, be almost always valid for many operators (see Section 4), and thus probably too weak to yield recovery guarantees.

Before stating our result, let us remark that the definition of the metric in Theorem 1 involves the operator , which is potentially problematic when it is random. To solve this, [12] introduces a so-called Boundedness Property (BP) in the classical sparse setting in finite dimension. We extend this notion in the considered context here.

###### Definition 5 (Boundedness property (BP)).

The operator satisfies the Boundedness Property with constant , pseudometric and probability if:

 ∀x∈E, ∀xS∈S,PΨ[∥Ψ(x)−Ψ(xS)∥F≤βdG(x,xS)]≥1−ρ. (7)

We then have the following result, proved in Appendix B.

###### Theorem 2 (The non-uniform LRIP and BP implies the non-uniform IOP).

Consider a random operator . Assume that:

1. the operator satisfies the non-uniform LRIP for the model with constant , pseudometric , probability and error  ;

2. the operator satisfies the non-uniform Boundedness Property with constant , pseudometric and probability  ;

Then, the decoder defined by (1) satisfies the non-uniform Instance Optimality Property for the operator , model and any projection function with constants , , pseudometrics and , probability and error .

Compared with the result in [12], which proves non-uniform recovery under a uniform LRIP and the BP in the finite-dimensional case, our result holds under weaker hypotheses.

For the converse implication of Theorem 2, unlike Theorem 1, the non-uniform IOP does not seem to directly imply the non-uniform LRIP.

## 4 A typical proof of the LRIP

In this section, we outline a possible strategy to prove the LRIP, inspired by the proof for random matrices in [2]. This relatively simple proof has two steps: first, a pointwise concentration result, and second, an extension by covering arguments. For a set and a metric , we denote by the minimum number of balls of radius , with centers that belong to , required to cover .

### 4.1 Linear case

We start with the linear case, which follows closely the proof in [2]. We treat the uniform case (Def. 2), with no error (). Assume and are both vector spaces, and that we have a random linear operator such that the following concentration result holds:

 (8)

for an increasing concentration function . Typically, the “bigger” the space is ( the more measurements we collect), the higher the concentration function is: often, for measurements ( or ), classical concentration inequalities yield .

This property proves a “pointwise” (or “fully” non-uniform) LRIP: for two given , the quantity is a good approximation of with high probability. We now invert the quantifiers by covering arguments. From the formulation of the concentration (4.1) we see that a particular set of interest is the so-called normalized secant set [22]:

 S={x−x′dE(x,x′) | x,x′∈S, dE(x,x′)>0}⊂E (9)

The proof of the following result is in Appendix C.

###### Proposition 1.

Consider . Assume that the concentration property (4.1) holds, that has finite covering numbers, and that for any draw of and any we have . Set . Define the probability of failure

 ρ=N(S,dE,δ)⋅e−c(t/2)

Then the operator satisfies the uniform LRIP for the model with constant , metric , probability and error .

This proof of the RIP has been used for instance in classical compressive sensing [2] or for random linear embeddings of Radon measures [18]. It is also used in a constructive manner to build appropriate operators in [22].

### 4.2 Non-linear case

It is possible to adapt the previous proof to non-linear operators, by distinguishing the case where and are “close”, for which we resort to a linearization of and properties of the normalized secant set, and the case where and are distant from each other, for which we use directly the covering numbers of the model. We treat here the non-uniform case (Def. 4).

Assume again that and are vector spaces. Assume that we have a random map such that the concentration property (4.1) holds. Next, suppose that there exists such that for any fixed :

1. for all and any draw of ,

2. the model has finite covering numbers with respect to , and in particular it also has finite diameter .

3. for all , the following version of the normalized secant set has finite covering numbers.

4. for all such that , and any draw of , we have where is a linear map such that for all , .

The following result is proved in Appendix D.

###### Proposition 2.

Assume that the properties above are satisfied. Set , and . Define the probability of failure

Then the operator satisfies the non-uniform LRIP for the model with constant , metric , probability and error .

### 4.3 Illustration

In this section we illustrate the non-linear LRIP on a simple example; that of recovering a vector from a random features embedding, which is a random map initially designed for kernel approximation, see [23, 24]. Such a random embedding can be seen as a one-layer neural network with random weights, for which invertibility and preservation of information have recently been topics of interest [16, 17].

Consider and define to be a Union of Subspaces, which is a popular model in compressed sensing [5], with controlled norm: where and each is an -dimensional subspace of . As in [18], we choose a sampling that is a reweighted version of the original Fourier sampling for kernel approximation [23], for the Gaussian kernel with bandwidth . It is defined as follows: for a number of measurements , the measurements space is , the random map is defined as where and are drawn from (where is a Gaussian), with and . One can verify that , is a valid probability distribution. The metric is here the kernel metric associated to the Gaussian kernel with bandwidth :

 dE(x,x′):=2(1−exp(−∥∥x−x′∥∥222σ2))

We have the following result, which proof is in Appendix E.

###### Proposition 3.

If the number of measurements is such that

 m≳t−2⋅(s⋅log(Mdσt)+log(N)+log(1ρ))

Then the operator satisfies the non-uniform LRIP for the model with constant , metric , probability and error , as well as the BP (Def. 5) with constant , metric and probability .

Hence, using Theorem 2, we have shown that a reduced number of random features preserves all information when encoding signals that are (in this case) well-modeled by a union of subspaces, with respect to the associated kernel metric. This preliminary analysis may have consequences for classical random feature bounds in a learning context [1, 25].

## 5 Conclusion

In this paper we generalized a classical property, the equivalence between the existence of an instance-optimal decoder and the LRIP, to non-linear inverse problems with possible quantization error or limited algorithmic precision, and data that live in any pseudometric set. We also formulated a version of the result for non-uniform recovery, by introducing a non-uniform version of the LRIP. To further illustrate this principle, we provided a typical proof strategy for the LRIP that one might use in practice, and gave an example of non-linear LRIP on random features for kernel approximation.

Although relatively simple in their proofs, these results may have important consequences for a large class of linear or non-linear inverse problems, where one seeks stable and robust recovery. Naturally, once the LRIP guarantees (or disproves) the existence of an instance optimal decoder, an outstanding question is the existence of efficient algorithms that provide equivalent guarantees, as in classical compressive sensing [15] or some of its recent extensions [26].

## References

• [1] Francis Bach. On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. Journal of Machine Learning Research, pages 1–38, 2017.
• [2] Richard Baraniuk, Mark Davenport, Ronald A Devore, and Michael Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, 2008.
• [3] Amir Beck and Yonina C. Eldar. Sparsity Constrained Nonlinear Optimization: Optimality Conditions and Algorithms. SIAM Journal on Optimization, 23(3):1480–1509, 2013.
• [4] Thomas Blumensath. Compressed sensing with nonlinear observations and related nonlinear optimization problems. Information Theory, IEEE Transactions on, pages 1–19, 2013.
• [5] Thomas Blumensath and Mike E. Davies. Sampling theorems for signals from the union of finite-dimensional linear subspaces. IEEE Transactions on Information Theory, 55(4):1872–1882, 2009.
• [6] Petros T Boufounos and Richard G Baraniuk. 1-Bit Compressive Sensing. In Information Sciences and Systems, pages 16–21, 2008.
• [7] Petros T. Boufounos, Shantanu Rane, and Hassan Mansour. Representation and Coding of Signal Geometry. Information and Inference: a Journal of the IMA, 6(4):349–388, 2017.
• [8] Anthony Bourrier, Mike E. Davies, Tomer Peleg, and Rémi Gribonval. Fundamental performance limits for ideal decoders in high-dimensional linear inverse problems. IEEE Transactions on Information Theory, 60(12):7928–7946, 2014.
• [9] Emmanuel J Candès. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9-10):1–4, 2008.
• [10] Emmanuel J. Candès, Justin K. Romberg, and Terrence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):480–509, 2006.
• [11] Emmanuel J. Candès and Terrence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
• [12] Albert Cohen, Wolfgang Dahmen, and Ronald A Devore. Compressed sensing and best k-term approximation. Journal of the American mathematical Society, 22(1):211–231, 2009.
• [13] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
• [14] Heinz W Engl and Philipp Kügler. Nonlinear inverse problems: theoretical aspects and some industrial applications. In Multidisciplinary Methods for Analysis Optimization and Control of Complex Systems, volume 06, pages 3–47. 2005.
• [15] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Springer New York, NY, 2013.
• [16] Anna C. Gilbert, Yi Zhang, Kibok Lee, Yuting Zhang, and Honglak Lee. Towards understanding the invertibility of convolutional neural networks. IJCAI International Joint Conference on Artificial Intelligence, pages 1703–1710, 2017.
• [17] Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deep Neural Networks with Random Gaussian Weights : A Universal Classification Strategy ? IEEE Transactions on Signal Processing, 64(13):3444–3457, 2016.
• [18] Rémi Gribonval, Gilles Blanchard, Nicolas Keriven, and Yann Traonmilin. Compressive Statistical Learning with Random Feature Moments. arXiv:1706.07180, 2017.
• [19] Laurent Jacques. Small width, low distortions: quasi-isometric embeddings with quantized sub-Gaussian random projections. pages 1–26, 2015.
• [20] Nicolas Keriven, Anthony Bourrier, Rémi Gribonval, and Patrick Pérèz. Sketching for Large-Scale Learning of Mixture Models. Information and Inference: A Journal of the IMA, pages 1–62, 2017.
• [21] Tomer Peleg, Rémi Gribonval, and Mike E. Davies. Compressed Sensing and Best Approximation from Unions of Subspaces: Beyond Dictionaries. European Signal Processing Conference (EUSIPCO), 2013.
• [22] Gilles Puy, Mike E. Davies, and Rémi Gribonval. Linear embeddings of low-dimensional subsets of a Hilbert space to R m. In European Signal Processing Conference (EUSIPCO), pages 469–473, 2015.
• [23] Ali Rahimi and Benjamin Recht. Random Features for Large Scale Kernel Machines. Advances in Neural Information Processing Systems (NIPS), 2007.
• [24] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in Neural Information Processing Systems (NIPS), 2009.
• [25] Alessandro Rudi and Lorenzo Rosasco. Generalization Properties of Learning with Random Features. In Advances in Neural Information Processing System (NIPS), 2017.
• [26] Yann Traonmilin and Rémi Gribonval. Stable recovery of low-dimensional cones in Hilbert spaces : One RIP to rule them all. Applied and Computational Harmonic Analysis, 2016.

## Appendix A Proof of Theorem 1

1. Consider . By triangular inequality we have

 dE(x,x′)≤dE(x,Δ(Ψ,Ψ(x′))+dE(Δ(Ψ,Ψ(x′)),x′).

Then, by applying the Instance Optimality Property with noise we get , and by applying again the Instance Optimality Property it holds that , hence the result.

2. Consider any signal and noise , denote and . Let be any element of the model. We have:

 dE(x⋆,˜x)≤ dE(x⋆,xS)+dE(xS,˜x) \lx@stackrelLRIP≤ ≤ dE(x⋆,xS)+α∥Ψ(xS)−y∥F+α∥y−Ψ(˜x)∥F+η .

By definition of the decoder (1) we have and therefore

 dE(x⋆,˜x)≤ dE(x⋆,xS)+2α∥Ψ(xS)−y∥F+η ≤ dE(x⋆,xS)+2α∥Ψ(xS)−Ψ(x⋆)∥F+2α∥Ψ(x⋆)−y∥F+η ≤ d′E(x⋆,xS)+2α∥e∥F+η

where . Since the result is valid for all , we can take the infimum of with respect to .

## Appendix B Proof of Theorem 2

Let be a fixed signal and denote by .

Applying the non-uniform LRIP, with probability at least on the draw of the operator we have

 (B.1)

In the same fashion, applying the non-uniform Boundedness Property, with probability at least on the draw of the operator we have

 ∥Ψ(x⋆)−Ψ(xS)∥F≤βdG(x⋆,xS) . (B.2)

Therefore, by a union bound, with probability at least , both (B.1) and (B.2) are satisfied. If this is the case, for all noise vector , denoting by and , it holds that:

 dE(x⋆,˜x)≤ dE(x⋆,xS)+dE(xS,˜x) \lx@stackrel(???)≤ ≤ dE(x⋆,xS)+α∥Ψ(xS)−y∥F+α∥y−Ψ(˜x)∥F+η .

Once again by definition of the decoder (1) we have and therefore

 dE(x⋆,˜x)≤ dE(x⋆,xS)+2α∥Ψ(xS)−y∥F+η ≤ dE(x⋆,xS)+2α∥Ψ(xS)−Ψ(x⋆)∥F+2α∥Ψ(x⋆)−y∥F+η ≤ d′E(x⋆,xS)+2α∥e∥F+η.

by applying (B.2), which is the desired result.

## Appendix C Proof of Proposition 1

From the definition of the normalized secant set, our goal is to prove that with high probability on the draw of , for all we have .

Let be small constants which values we shall define later and be the covering numbers of the normalized secant set. Let be an -covering of the normalized secant set .

By the concentration property, it holds that, with probability at least , for all we have

 ∥Ψsi∥F≥1−t/2 (C.1)

Now, given any element of the normalized secant set , one can find an element of the covering such that . Assuming (C.1) holds, we have

 ∥Ψs∥F≥∥Ψsi∥F−∥Ψ(s−si)∥F≥1−t2−Cδ

and therefore by choosing we obtain the desired result.

## Appendix D Proof of Proposition 2

Fix any . Let be small constants which values we shall define later, such that . Define the model with a ball around removed, and , . Let and be, respectively, a -covering of and a -covering of , with defined such that and .

By the concentration property, it holds that, with probability at least , for all or for all indices , we have

 ∥Ψ(xS)−Ψ(~x)∥F≥(1−t/2)⋅dE(xS,~x). (D.1)

Our goal is to extend this property to any element in the model .

Let be any element of the model. We distinguish two cases. If , , we consider an element of the covering of such that . We have:

 ∥∥Ψ(xS)−Ψ(x)dE(xS,x)−Ψ(xS)−Ψ(xi)dE(xS,xi)∥∥F≤ = ∥Ψ(x)−Ψ(xi)∥FdE(xS,x)+∥Ψ(xS)−Ψ(xi)∥F∣∣1dE(xS,x)−1dE(xS,xi)∣∣ \lx@stackrel(i),(ii)≤ C1δε+C1MSdE(x,xi)dE(xS,x)dE(xS,xi) ≤ C1δε(1+MSε)

and therefore

 ∥∥∥Ψ(xS)−Ψ(x)dE(xS,x)∥∥∥F≥1−t/2−C1δε(1+MSε) (D.2)

Now, when , we define and note that . We approximate it by an element of the covering of the normalized secant set (meaning that ) that verifies . Then, we have

 ∥∥ ∥∥Ψ(xS)−Ψ(x)dE(xS,x)−Ψ(xS)−Ψ(x′i)dE(xS,x′i)∥∥ ∥∥F≤ \lx@stackrel(iv)≤ C2dE(xS,x)+C3δ′+C2dE(xS,x′i)≤2C2ε+C3δ′

and therefore

 ∥∥∥Ψ(xS)−Ψ(x)dE(xS,x)∥∥∥F≥1−t/2−(2C2ε+C3δ′) (D.3)

To conclude, we set , , to obtain the desired result.

## Appendix E Proof of Proposition 3

#### Concentration property.

The concentration result is based on the fact that by definition of the random features. Using simple function studies and a Bernstein concentration inequality with a control on moments of all orders, it is possible to show (see [18], eq. (160) then Prop. 6.11) that the concentration result (4.1) is valid with (we do not reproduce the detailed proof here for brevity).

Since this Berstein inequality is in fact valid for all vectors (not necessarily in the model), as a consequence the Boundedness Property is also satisfied with constant , metric and probability .

We now check hypotheses in Proposition 2. We are going to repeatedly use the fact that for , since , we have

 ℓ∥∥x−x′∥∥2≤dE(x,x′)≤L∥∥x−x′∥∥2 (E.1)

where and .

#### (i)

Using a first-order Taylor expansion we have hence by (E.1) hypothesis is valid with .

#### (ii)

It is immediate that . Then, using the well-known fact that for each we have , by a union bound and (E.1) we have .

#### (iii)

Similar to the model, for all the normalized secant set is included in a union of subspace , with norm controlled by by (E.1), and where (sum of subspaces). Hence, since , by a union bound and (E.1) we have .

#### (iv)

Finally, by a Taylor expansion we have

 φω(x)−φω(x′)=iω⊤f(ω)(x−x′)eiω⊤x−12⋅(ω⊤(x−x′))2f(ω)eiω⊤x′′

where . Hence hypothesis is satisfied with and , which concludes the proof.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters