Lipschitz Learning for Signal Recovery

Lipschitz Learning for Signal Recovery

Hong Jiang, Jong-Hoon Ahn and Xiaoyang Wang
Nokia Bell Labs
Murray Hill, NJ 07974
{hong.jiang,jong_hoon.ahn,xiaoyang.wang}@nokia-bell-labs.com
Abstract

We consider the recovery of signals from their observations, which are samples of a transform of the signals rather than the signals themselves, by using machine learning (ML). We will develop a theoretical framework to characterize the signals that can be robustly recovered from their observations by an ML algorithm, and establish a Lipschitz condition on signals and observations that is both necessary and sufficient for the existence of a robust recovery. We will compare the Lipschitz condition with the well-known restricted isometry property of the sparse recovery of compressive sensing, and show the former is more general and less restrictive. For linear observations, our work also suggests an ML method in which the output space is reduced to the lowest possible dimension.

1 Introduction

In many applications, a signal is only available as observations which are samples of a transform of the signal, rather than the samples of the signal itself. Examples are compressive sensing Candès (2006); Candès and Tao (2005), in which a signal is under a dimension-reducing linear transform, wireless communications in which a signal undergoes a linear or nonlinear channel transform Kim and Konstantinou (2001); Jiang et al. (2015), and computational imaging in which the acquired data is a result of light field going through a transform due to optical devices Duarte et al. (2008); Sun et al. (2013). Recently, machine learning (ML) algorithms have demonstrated superior performance in recovering signals from observations Goodfellow et al. (2014); Mousavi and Baraniuk (2017); Kulkarni et al. (2016). To recover a signal from its observation, an ML model, such as a convolutional neural network, is trained so that the recovered signal is the output of the model when its observation is used as the input. Despite the great success of ML in recovering signals from observations and the development in ML theory in general Valiant (1984); Vapnik (2000); Blumer et al. (1989); Kearns and Vazirani (1994); Abu-Mostafa et al. (2012), there is a lack of theoretical understanding in many aspects of ML recovery.

This work is to address the question of under what condition a signal can be recovered from its observation by an ML algorithm. We develop a theoretical framework to characterize the signals that can be robustly recovered from observations. We will establish a Lipschitz condition on signals and observations and show that it is both necessary and sufficient for the existence of a robust ML algorithm to recover the signals. We will compare the Lipschitz condition with the restricted isometry property (RIP) Candès (2006); Candès and Tao (2005) in the sparse signal recovery of compressive sensing, and show that the former is more general and less restrictive.

The set of signals satisfying the Lipschitz condition is not unique. Since there is no restriction on the transform of the observations, there is no expectation that a given set of signals can be recovered from their observations. Instead, what is expected is that all signals with certain structure should be recovered. In our framework, the structure of the recoverable signals is precisely defined by the Lipschitz condition: each Lipschitz set is a set of structured signals that are robustly recoverable and a different set defines a different structure. A finite number of training signals can always be used to define a Lipschitz set of signals which can be robustly recovered by a trained, robust ML model.

The significance of this work is that it not only answers the theoretical question of what signals can be robustly recovered, but also suggests a practical recovery method by using singular value decomposition (SVD) for linear observations (see Theorem 3), in which the dimension of the output space of the target function is reduced to the minimum possible.

All proofs of the paper will be given in Appendix.

2 Related Work

The term “Lipshitz learning” was previously used for classification on graphs Kyng et al. (2015), in which the target functions in graph-based semi-supervised learning are Lipschitz. In this paper, the same term is used in a different context and a broader sense. In this paper, Lipschitz learning refers to the framework of recovering signals satisfying the Lipschitz condition. Since in this work, the recovery can be achieved by Lipschitz hypothesis, the use of the term here is consistent with its previous use.

In addition to Kyng et al. (2015), existing work in v. Luxburg and Bousquet (2004); Koltchinskii (2011); Lopez-Paz et al. (2015) also studies to utilize Lipschitz functions as decision functions or target functions for the classification problem. Specifically, v. Luxburg and Bousquet (2004) finds that the Lipschitz function is a generalization of decision functions for metric spaces, and shows that several well-known algorithms are special cases of the Lipschitz classifier. Koltchinskii (2011) poses the cause-effect inference problem as a classification problem, and uses the property of Lipschitz function to derive the bound on excess risk. In addition, Lipschitz function is used in Lopez-Paz et al. (2015) for theoretic analysis of empiric risk minimization. Our work differs from the existing work v. Luxburg and Bousquet (2004); Koltchinskii (2011); Kyng et al. (2015); Lopez-Paz et al. (2015) in the following two aspects: 1) We utilize the Lipschitz condition for the problem of general signal recovery, whereas v. Luxburg and Bousquet (2004); Koltchinskii (2011); Kyng et al. (2015); Lopez-Paz et al. (2015) utilizes Lipschitz functions for the problem of classification. 2) To the best of our knowledge, no existing work utilizes the property of Lipschitz set, which is essential in our theory of signal recovery with Lipschitz learning.

Our framework shows that the Lipschitz condition on a set of signals is equivalent to the existence of a hypothesis for the recovery of these signals. It differs from the probably approximate correct (PAC) learning Valiant (1984), and the statistical learning theory Vapnik (2000) that analyzes the probability in successfully finding a hypothesis with low generalization error. Our work is currently concerned with the existence of Lipschitz hypothesis, but in the future, will address the complexity of Lipschitz learning such as reducing the bound on number of total training samples required, which includes, for example, using a probabilistic model in Lipschitz learning.

3 Lipschitz Learning

Problem Definition. Let be a signal, be an operator with . The observation of signal under transform is , where the symbol "" means "operates on". The operator may be linear or nonlinear, and it may not be an injection even when . The objective here is, for a given , to recover the signal from its observation by a machine learning algorithm. In an ML algorithm, a hypothesis is a computable function . A recovered signal from the observation by the hypothesis is , with .

Since may not be injective, there is no expectation that a signal can be uniquely recovered from a given observation . Instead, we attempt to characterize a set of signals that can be robustly recovered from their observations by an ML algorithm. Such a characterization is tantamount to imposing a structure on signals to ensure the success of recovery. For example, in compressive sensing Candès (2006); Candès and Tao (2005), observations are the results of a singular linear transform but it is possible to uniquely recover a set of sparse signals under certain conditions.

Let be a set of signals. For all signals in to be recovered from their observations, a necessary condition is

(1)

Furthermore, for a recovery to be robust and resilient to noise, it is required that

(2)

Motivated by Eqs. (1) and (2), we make the following definition.

Definition 1. Given , a set is said to be if

(3)

A set is said to be a Lipschitz set if there is an such that it is -Lipschitz. We denote an set by , and call the signals in a Lipschitz set the Lipschitz signals.

Note in Definition 1, is simply a notation; it doesn’t mean exists. However, when restricted on ,   does have an inverse, and its inverse is Lipschitz. An ML algorithm is to find a hypothesis to approximate it.

The Lipschitz condition in Eq. (3) is a joint condition on signals and their observations (or the operator ). It may be framed in the following two ways.

1) For a given set of signals , Eq. (3) is a condition on the operator . It is equivalent to saying that the inverse must exist on , and the inverse is -Lipschitz. Traditional signal recovery algorithms, such as minimization in compressive sensing, is within this framework, i.e., they attempt to recover all signals in a given structure under the assumption that the operator meets certain conditions.

2) For a given operator , Eq. (3) is a condition on a set of signals to be recovered. For any operator , there is always a set satisfying Eq. (3): any singleton set. An ML algorithm may be designed to recover those signals of interest that are recoverable for the given , by properly selecting training signals to define a set of signals of interest to satisfy Eq. (3). In this context, the Lipschitz signals are the structured signals.

Example. Let operator be the continuous function defined by

(4)

The set is not Lipschitz, but , or , is an -Lipschitz set. For any , the set is not Lipschitz although is injective on it; clearly, signals in cannot be recovered reliably under noise because a small noise in the observation may cause the recovered signal to be or or . On the other hand, is a Lipschitz set for any . For example, for any , is -Lipschitz, and so is .

Property 1. If is a finite set, i.e., , and is an injection on , then is -Lipschitz where

(5)

Property 2. Let be a linear operator, and   be an -Lipschitz set. Then any scaled and shifted set from   is also an -Lipschitz set. More precisely, for any , and ,

(6)

is an -Lipschitz set.

Property 1 shows that any finite set on which is injective is Lipschitz, and therefore, it can be used as a starting point to build a Lipschitz set of signals of interest. For example, a finite set of training signals may be used to define a maximal set of Lipschitz signals that includes the training signals.

Definition 2. A machine learning hypothesis is said to be -Lipschitz, for , if

(7)

A hypothesis is said to be Lipschitz, or robust, if there is an such that is -Lipschitz.

Definition 3. A set is said to be labeled if every and its observation are known.

4 Characterization of Signal Recovery

In this section, we will show the Lipschitz condition on a set of signals is equivalent to the existence robust ML hypothesis for recovery of the signals. More precisely, we will show that the Lipschitz condition Eq. (3) is both necessary and sufficient for the existence of Lipschitz hypotheses in the ML signal recovery.

In the rest of this paper, we assume the observations are bounded, which is generally the case in practice. Without loss of generality, we may assume they are bounded by the unit hypercube, i.e.,

(8)

Lemma 1. Let be a finite set and labeled, and be an injection. Then there exists an -Lipschitz hypothesis such that

(9)

Lemma 1 is an application of the McShane-Whitney extension theorem McShane (1934); Whitney (1934). It provides an explicit and constructive Lipschitz hypothesis on a finite labeled set (see proof in Appendix). Furthermore, Eq. (9) shows that the finite set is a training set for the Lipschitz hypothesis. The training set can then be expanded to a Lipschitz set in which all signals can be recovered robustly, as to be seen in the next Theorem which shows that the Lipschitz set is sufficient for the existence of a Lipschitz hypothesis.

Theorem 1. Let be an -Lipschitz set. Then for any , there exists a finite set . If is labeled, then there exists an -Lipschitz hypothesis , such that

(i)  , for all ; (Training)

(ii) , for all . (Recovery of all -Lipschitz signals)

The factor in -Lipschitz is not necessary and can be removed; it is there only to simplify the proof, which is given in Appendix.

Theorem 1 means that if a set of signals is Lipschitz, then for any given precision, there exists a finite set of training signals so that a Lipschitz hypothesis can be trained on the finite set to recover all signals within the given precision. It guarantees that a set of Lipschitz signals can be recovered by a robust ML algorithm, to an arbitrary precision. We note that although the training set in Theorem 1 is finite in theory, it may be too large for practical purposes.

A Lipschitz hypothesis is stronger than a continuous target function. It could be argued that a continuous target function is sufficient to provide robustness of recovery, so a question would arise as to if the Lipschitz condition (3) is too strong. However, since a set of signals may be discrete, not continuous or connected, it is not possible to define a "continuity" on in the classic sense to guarantee robust recovery, as Lipschitz condition (3) does. The Lipschitz set in Definition 1 is a sensible condition on a (possibly discrete) set of signals for robust recovery. More discussions regarding the Lipschitz condition will be given in Section 5.

Next, we show that the Lipschitz set is necessary for the existence of Lipschitz hypothesis.

Theorem 2. Let be a set. If there exists such that for any there is an -Lipschitz hypothesis , such that

(10)

then is an -Lipschitz set.

Theorem 2 says that if there are -Lipschitz hypotheses to recover a set of signals to an arbitrary precision, then the set itself must be -Lipschitz. A weaker version is given below.

Corollary 1. Let be a set. If there exist an and an -Lipschitz hypothesis such that for all , then satisfies

(11)

This result says that if a set of signals can be recovered to a certain precision by a Lipschitz hypothesis, then the set of signals is approximately Lipschitz, up to the precision of the recovery.

Theorems 1 and 2 completely characterize robust ML signal recovery: a set of signals can be robustly recovered by ML algorithms if and only if the set satisfies the Lipschitz condition (3).

For linear operators, we have a stronger version of Theorem 1 as follows.

Theorem 3. Let be linear, and be an -Lipschitz set. Then there exist matrices and . Furthermore, for any , there exists a finite set . If is labeled, then there exists an -Lipschitz hypothesis , such that the mapping defined by satisfies

(i)   , for all ; (Training)

(ii)  , for all ; (Recovery of all -Lipschitz signals)

(iii) , for all . (Recovered signals match the observations)

The significance of Theorem 3 as compared to Theorem 1 is twofold. First, the output space of the hypothesis in Theorem 3 has lower dimension than that of Theorem 1: in Theorem 3 vs in Theorem 1. Consequently, the bound on the total number of required training signals is lower in Theorem 3 than in Theorem 1. Secondly, the recovered signals have the same observations as the original signals, as in (iii). In other words, even if the recovered signal may not equal the original signal , the observations are the same: , i.e., the recovered signal is indistinguishable from the original signal in the observation space.

5 Comparison with Sparse Recovery in Compressive Sensing

In this section, we assume the operator is linear, i.e., .

Sparse recovery Candès (2006)

Let , , and be the submatrix obtained by extracting the columns of corresponding to the indices in . is said to satisfy -restricted isometry property (RIP) if there exists such that

(12)

for all subsets with and coefficient sequences . is said to be -restricted isometry constant. It is shown in Candès and Tao (2005) that if satisfies the RIP with

(13)

then an -sparse signal can be recovered from its observation by -minimization Candès (2006).

Sparse signals with RIP conditions (12) and (13) are Lipschitz

Let the set of -sparse signals be . Conditions (12) and (13) in fact imply that is -Lipschitz. Indeed, let . Then is -sparse. According to Eq. (12),

(14)

where according to (13). Condition (14) leads to

(15)

Therefore, the set of -sparse signals for which RIP with (13) is satisfied is an -Lipschitz set, and consequently, according to Theorem 1, there exists a robust ML recovery algorithm for the -sparse signals if RIP is satisfied with condition (13).

This shows that the Lipschitz condition (3) is more general and less restrictive than the RIP conditions (12) and (13). Of course, it must also be pointed out that the stronger RIP conditions (12) and (13) lead to a strong and constructive result that -sparse signals can be recovered by -minimization.

6 Conclusion

We have developed a framework to characterize the robust ML signal recovery. The theory in the framework makes the terminology "structured signals" in traditional signal recovery algorithms more precise. Here, the structured signals are the Lipschitz signals. For any given transformation , it is always possible to define a set of Lipschitz signals, i.e., structured signals, so that they can be robustly recovered by a trained ML model.

Although we have provided a complete characterization of ML signal recovery in theory, more work is needed to render this theoretical framework for practical use in general. For example, the bound on the total number of training signals required to guarantee robust recovery in this framework is too high to be used in practice. However, this theoretical work does provide insights that can guide the design of practical ML signal recovery algorithms. For linear systems, Theorem 3 suggests a practical method of using SVD to reduce the dimension of the output space of an ML model from to , which is the minimum possible dimension on which a recovery algorithm must learn.

Appendix

Proof of Property 1. If is a finite set and is injective on , then is well-defined in (5), and furthermore,

(16)

which shows , i.e., is an -Lipschitz set. Q.E.D.

Proof of Property 2. Let . There exist , with , so

which shows that is an -Lipschitz set. Q.E.D.

Proof of Lemma 1. Since is finite and is injective on it, it follows from Property 1 that is -Lipschitz for some . Following McShane-Whitney extension theorem McShane (1934); Whitney (1934), we define by

(17)

We show that is -Lipschitz. Indeed, since is finite, for any , there exists such that . Furthermore, from definition (17), , and therefore,

(18)

Reversing the roles of and in (18), we also have . This, together with (18), shows

(19)

Define by

(20)

Let . From (19), we have

(21)

which shows is -Lipschitz. From (17), it’s easy to verify that . Q.E.D.  It is important to point out that Eq.(17) provides an explicit and constructive Lipschitz hypothesis.

Proof of Theorem 1. By assumption, . Now define

(22)

Each in (22) is a hypercube in of length in each dimension, and therefore, we have

(23)

We now define

(24)

It is clear that  is -Lipschitz and finite with

(25)

It follows from Lemma 1 that there exists an -Lipschitz hypothesis and for all , which proves (i).

We now show (ii). For any , because , according to (8) and (22), there exists a such that . From (24), and .

(26)

Proof of Theorem 2. To show is -Lipschitz, let . For any , we have

(27)

Since is arbitrary while other variables are fixed, (27) implies (3), i.e., is Lipschitz. Q.E.D.

Proof of Corollary 1. Eq. (11) of Corollary 1. follows immediately from (27). Q.E.D.

Proof of Theorem 3. We start by following the same process as in the proof of Theorem 1, but change the factor in (22) to to obtain hypercubes and a finite set . Instead of defined in (22) and the bounds derived in (25) and (23), we now have

(28)

Without loss of generality, we assume is full rank (if not, can be reduced until it is). Performing the singular value decomposition (SVD) of , we have

(29)

are unitary matrices, and 0 is the matrix with all entries being 0. for all . We further split as where and . It is easy to show the following

(30)

Define by

(31)

We will show that is -Lipschitz. Indeed, for

(32)

Since is finite and -Lipschitz according to (32), Lemma 1 says that there is an -Lipschitz hypothesis such that for all , and consequently from (31),

(33)

Note that the output space of the hypothesis in (33) has dimension , instead of in Theorem 1. Now define by

(34)

The following shows (i):

(35)

The following shows (iii):

(36)

To show (ii), we note that similar to Proof of Theorem 1, for , there is an , such that .

(37)

Finally, for ,

(38)

References

  • [1] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H. Lin (2012) Learning from data. AMLBook. External Links: ISBN 1600490069, 9781600490064 Cited by: §1.
  • [2] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989) Learnability and the vapnik-chervonenkis dimension. J. ACM 36 (4), pp. 929–965. Cited by: §1.
  • [3] E. J. Candès (2006) Compressed sampling. In Proceedings of the International Congress of Mathematicians, Cited by: §1, §1, §3, §5, §5.
  • [4] E.J. Candès and T. Tao (2005) Decoding by linear programming. IEEE Trans. Inform. Theory 51, pp. 4203–4215. Cited by: §1, §1, §3, §5.
  • [5] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk (2008) Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine 25 (2), pp. 83 – 91. Cited by: §1.
  • [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, A. C. S. Ozair, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27 (NIPS 2014). Cited by: §1.
  • [7] H. Jiang, G. Huang, P. Wilford, and L. Yu (2015) Constrained and preconditioned stochastic gradient method. IEEE Transactions on Signal Processing 63 (10), pp. 2678 – 2691. Cited by: §1.
  • [8] V. M. Kearns and U. Vazirani (1994) An introduction to computational learning theory. MIT Press. Cited by: §1.
  • [9] J. Kim and K. Konstantinou (2001) Digital predistortion of wideband signals based on power amplifier model with memory. Electronics Letters 37 (23), pp. 1417 – 1418. Cited by: §1.
  • [10] V. Koltchinskii (2011) Oracle inequalities in empirical risk minimization and sparse recovery problems. Springer, Berlin, Heidelberg. Cited by: §2.
  • [11] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok (2016) Reconnet: non-iterative reconstruction of images from compressively sensed measurements. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §1.
  • [12] R. Kyng, A. Rao, S. Sachdeva, and D. A. Spielman (2015) Algorithms for lipschitz learning on graphs. Proceedings of The 28th Conference on Learning Theory, pp. 1190–1223. Cited by: §2, §2.
  • [13] D. Lopez-Paz, K. Muandet, B. Scholkopf, and I. Tolstikhin (2015) Towards a learning theory of cause-effect inference. International Conference on Machine Learning (ICML; JMLR W&CP) 37, pp. 1452–1461. Cited by: §2.
  • [14] E. J. McShane (1934) Extension of range of functions. Bull. Amer. Math. Soc 40 (12), pp. 837–842. Cited by: §4, Appendix.
  • [15] A. Mousavi and R. G. Baraniuk (2017) Learning to invert: signal recovery via deep convolutional networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Cited by: §1.
  • [16] B. Sun, M. P. Edgar, R. Bowman, L. E. Vittert, S. Welsh, A. Bowman, and M.J. Padgett (2013) 3D computational imaging with single-pixel detectors. Science 340 (6134), pp. 844–847. Cited by: §1.
  • [17] U. v. Luxburg and O. Bousquet (2004) Distance-based classification with lipschitz functions. Journal of Machine Learning Research 5 (Jun), pp. 669–695. Cited by: §2.
  • [18] L. Valiant (1984) A theory of the learnable. Communications of the ACM 27. Cited by: §1, §2.
  • [19] V. Vapnik (2000) The nature of statistical learning theory. Springer. Cited by: §1, §2.
  • [20] E. J. H. Whitney (1934) Analytic extensions of differentiable functions defined in closed sets. Trans. Amer. Math. Soc. 36 (1), pp. 63–89. Cited by: §4, Appendix.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393317
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description