Sparse Fourier Transform in Any Constant Dimension with Nearly-Optimal Sample Complexity in Sublinear Time
We consider the problem of computing a -sparse approximation to the Fourier transform of a length signal. Our main result is a randomized algorithm for computing such an approximation (i.e. achieving the sparse recovery guarantees using Fourier measurements) using samples of the signal in time domain that runs in time , where is the dimensionality of the Fourier transform. The sample complexity matches the lower bound of for non-adaptive algorithms due to [DIPW10] for any for a constant up to an factor. Prior to our work a result with comparable sample complexity and sublinear runtime was known for the Fourier transform on the line [IKP14], but for any dimension previously known techniques either suffered from a factor loss in sample complexity or required runtime.
- 1 Introduction
- 2 Preliminaries
- 3 The algorithm and proof overview
- 4 Organization
- 5 Analysis of LocateSignal: main definitions and basic claims
- 6 Analysis of LocateSignal: bounding norm of undiscovered elements
- 7 Analysis of ReduceL1Norm and SparseFFT
- 8 guarantees and constant SNR case
- 9 Utilities
- 10 Semi-equispaced Fourier Transform
- 11 Acknowledgements
- A Omitted proofs
The Discrete Fourier Transform (DFT) is a fundamental mathematical concept that allows to represent a discrete signal of length as a linear combination of pure harmonics, or frequencies. The development of a fast algorithm for Discrete Fourier Transform, known as FFT (Fast Fourier Transform) in 1965 revolutionized digital signal processing, earning FFT a place in the top 10 most important algorithms of the twentieth century [Cip00]. Fast Fourier Transform (FFT) computes the DFT of a length signal in time , and finding a faster algorithm for DFT is a major open problem in theoretical computer science. While FFT applies to general signals, many of the applications of FFT (e.g. image and video compression schemes such as JPEG and MPEG) rely on the fact that the Fourier spectrum of signals that arise in practice can often be approximated very well by only a few of the top Fourier coefficients, i.e. practical signals are often (approximately) sparse in the Fourier basis.
Besides applications in signal processing, the Fourier sparsity property of real world signal plays and important role in medical imaging, where the cost of measuring a signal, i.e. sample complexity, is often a major bottleneck. For example, an MRI machine effectively measures the Fourier transform of a signal representing the object being scanned, and the reconstruction problem is exactly the problem of inverting the Fourier transform of approximately given a set of measurements. Minimizing the sample complexity of acquiring a signal using Fourier measurements thus translates directly to reduction in the time the patient spends in the MRI machine [LDSP08] while a scan is being taken. In applications to Computed Tomography (CT) reduction in measurement cost leads to reduction in the radiation dose that a patient receives [Sid11]. Because of this strong practical motivation, the problem of computing a good approximation to the FFT of a Fourier sparse signal fast and using few measurements in time domain has been the subject of much attention several communities. In the area of compressive sensing [Don06, CT06], where one studies the task of recovering (approximately) sparse signals from linear measurements, Fourier measurements have been one of the key settings of interest. In particular, the seminal work of [CT06, RV08] has shown that length signals with at most nonzero Fourier coefficients can be recovered using only samples in time domain. The recovery algorithms are based on linear programming and run in time polynomial in . A different line of research on the Sparse Fourier Transform (Sparse FFT), initiated in the fields of computational complexity and learning theory, has been focused on developing algorithms whose sample complexity and running time scale with the sparsity as opposed to the length of the input signal. Many such algorithms have been proposed in the literature, including [GL89, KM91, Man92, GGI02, AGS03, GMS05, Iwe10, Aka10, HIKP12b, HIKP12a, BCG12, HAKI12, PR13, HKPV13, IKP14]. These works show that, for a wide range of signals, both the time complexity and the number of signal samples taken can be significantly sub-linear in , often of the form .
In this paper we consider the problem of computing a sparse approximation to a signal given access to its Fourier transform .
The algorithms are randomized
|[CT06, RV08, CGV12]|
|[Bou14, HR16]||linear program||yes|
Higher dimensional Fourier transform. While significant attention in the sublinear Sparse FFT literature has been devoted to the basic case of Fourier transform on the line (i.e. one-dimensional signals), the sparsest signals often occur in applications involving higher-dimensional DFTs. Although a reduction from DFT on a two-dimensional grid with relatively prime side lengths to a one-dimensional DFT of length is possible [GMS05, Iwe12]), the reduction does not apply to the most common case when the side lengths of the grid are equal to the same powers of two. It turns out that most sublinear Sparse FFT techniques developed for the one-dimensional DFT do not extend well to the higher dimensional setting, suffering from at least a polylogaritmic loss in sample complexity. Specifically, the only prior sublinear time algorithm that applies to general grids is due to to [GMS05], has sample and time complexity for a rather large value of . If is a power of , a two-dimensional adaptation of the [HIKP12a] algorithm (outlined in [GHI13]) has roughly time and sample complexity, and an adaptation of [IKP14] has sample complexity. In general dimension both of these algorithms have sample complexity .
Thus, none of the results obtained so far was able to guarantee sparse recovery from high dimensional (any ) Fourier measurements without suffering at least a polylogarithmic loss in sample complexity, while at the same time achieving sublinear runtime.
Our results. In this paper we give an algorithm that achieves the sparse recovery guarantees (1) with -dimensional Fourier measurements that uses samples of the signal and has the running time of . This is the first sublinear time algorithm that comes within a factor of the sample complexity lower bound of due to [DIPW10] for any dimension higher than one.
Sparse Fourier Transform overview. The overall outline of our algorithm follows the framework of [GMS05, HIKP12a, IKP14, IK14], which adapt the methods of [CCFC02, GLPS10] from arbitrary linear measurements to Fourier measurements. The idea is to take, multiple times, a set of linear measurements of the form
for random hash functions and random sign changes with . This corresponds to hashing to buckets. With such ideal linear measurements, hashes suffice for sparse recovery, giving an sample complexity.
The sparse Fourier transform algorithms approximate using linear combinations of Fourier samples. Specifically, the coefficients of are first permuted via a random affine permutation of the input space. Then the coefficients are partitioned into buckets. This step uses the“filtering” process that approximately partitions the range of into intervals (or, in higher dimension, squares, or balls) with coefficients each, where each interval corresponds to one bucket. Overall, this ensures that most of the large coefficients are “isolated”, i.e., are hashed to unique buckets, as well as that the contributions from the “tail” of the signal to those buckets is not much greater than the average (the tail of the signal defined as ). This allows one to mimic the iterative recovery algorithm described for linear measurements above. However, there are several difficulties in making this work using an optimal number of samples.
This enables the algorithm to identify the locations of the dominant coefficients and estimate their values, producing a sparse estimate of . To improve this estimate, we repeat the process on by subtracting the influence of during hashing, thereby refining the approximation of constructed. After a few iterations of this refinement process the algorithm obtains a good sparse approximation of .
A major hurdle in implementing this strategy is that any filter that has been constructed in the literature so far is imprecise in that coefficients contribute (“leak”’) to buckets other than the one they are technically mapped into. This contribution, however, is limited and can be controlled by the quality of the filter. The details of filter choice have played a crucial role in recent developments in Sparse FFT algorithms. For example, the best known runtime for one-dimensional Sparse FFT, due to [HIKP12b], was obtained by constructing filters that (almost) precisely mimic the ideal hash process, allowing for a very fast implementation of the process in dimension one. The price to pay for the precision of the filter, however, is that each hashing becomes a factor more costly in terms of sample complexity and runtime than in the idealized case. At the other extreme, the algorithm of [GMS05] uses much less precise filters, which only lead to a loss of sample complexity in higher dimensions , for a constant . Unfortunately, because of the imprecision of the filters the iterative improvement process becomes quite noisy, requiring iterations of the refinement process above. As [GMS05] use fresh randomness for each such iteration, this results in an factor loss in sample complexity. The result of [IKP14] uses a hybrid strategy, effectively interpolating between [HIKP12b] and [GMS05]. This gives the near optimal sample complexity in dimension one (i.e. Fourier transform on the line), but still suffers from a loss in dimension .
Techniques of [IK14]. The first algorithm to achieve optimal sample complexity was recently introduced in [IK14]. The algorithms uses an approach inspired by [GMS05] (and hence uses ‘crude’ filters that do not lose much in sample complexity), but introduces a key innovation enabling optimal sample complexity: the algorithm does not use fresh hash functions in every repetition of the refinement process. Instead, hash functions are chosen at the beginning of the process, such that each large coefficient is isolated by most of those functions with high probability. The same hash functions are then used throughout the execution of the algorithm. As every hash function required a separate set of samples to construct the buckets, reusing the hash functions makes sample complexity independent of the number of iterations, leading to the optimal bound.
While a natural idea, reusing hash functions creates a major difficulty: if the algorithm identified a non-existent large coefficient (i.e. a false positive) by mistake and added it to , this coefficient would be present in the difference vector (i.e. residual signal) and would need to be corrected later. As the spurious coefficient depends on the measurements, the ‘isolation’ properties required for recovery need not hold for it as its position is determined by the hash functions themselves, and the algorithm might not be able to correct the mistake. This hurdle was overcome in [IK14] by ensuring that no large coefficients are created spuriously throughout the execution process. This is a nontrivial property to achieve, as the hashing process is quite noisy due to use of the ‘crude’ filters to reduce the number of samples (because the filters are quite simple, the bucketing process suffers from substantial leakage). The solution was to recover the large coefficients in decreasing order of their magnitude. Specifically, in each step, the algorithm recovered coefficients with magnitude that exceeded a specific threshold (that decreases at an exponential rate). With this approach the norm of the residual signal decreases by a constant factor in every round, resulting in the even stronger sparse recovery guarantees in the end. The price to pay for this strong guarantee was the need for a very strong primitive for locating dominant elements in the residual signal: a primitive was needed that would make mistakes with at most inverse polynomial probability. This was achieved by essentially brute-force decoding over all potential elements in : the algorithm loops over all elements and for each tests, using the measurements taken, whether is a dominant element in the residual signal. This resulted in runtime.
Our techniques. In this paper we show how to make the aforementioned algorithm run in sub-linear time, at the price of a slightly increased sampling complexity of . To achieve a sub-linear runtime, we need to replace the loop over all coefficients by a location primitive (similar to that in prior works) that identifies the position of any large coefficient that is isolated in a bucket in time per bucket, i.e. without resorting to brute force enumeration over the domain of size . Unfortunately, the identification step alone increases the sampling complexity by per hash function, so unlike [IK14], here we cannot repeat this process using hash functions to ensure that each large coefficient is isolated by one of those functions. Instead, we can only afford hash functions overall, which means that fraction of large coefficients will not be isolated in most hashings. This immediately precludes the possibility of using the initial samples to achieve norm reduction as in [IK14]. Another problem, however, is that the weaker location primitive that we use may generate spurious coefficients at every step of the recovery process. These spurious coefficients, together with the fraction of non-isolated elements, contaminate the recovery process and essentially render the original samples useless after a small number of refinement steps. To overcome these hurdles, instead of the reduction process of [IK14] we use a weaker invariant on the reduction of mass in the ‘heavy’ elements of the signal throughout our iterative process. Specifically, instead of reduction of norm of the residual as in [IK14] we give a procedure for reducing the norm of the ‘head’ of the signal. To overcome the contamination coming from non-isolated as well as spuriously created coefficients, we achieve norm reduction by alternating two procedures. The first procedure uses the hash functions to reduce the norm of ‘well-hashed’ elements in the signal, and the second uses a simple sparse recovery primitive to reduce the norm of offending coefficients when the first procedure gets stuck. This can be viewed as a signal-to-noise ratio (SNR) reduction step similar in spirit the one achieved in [IKP14]. The SNR reduction phase is insufficient for achieving the sparse recovery guarantee, and hence we need to run a cleanup phase at the end, when the signal to noise ratio is constant. It has been observed before (in [IKP14]) that if the signal to noise ratio is constant, then recovery can be done using standard techniques with optimal sample complexity. The crucial difference between [IKP14] and our setting is, however, that we only have bounds on -SNR as opposed to -SNR In [IKP14]. It turns out, however, that this is not a problem – we give a stronger analysis of the corresponding primitive from [IKP14], showing that -SNR bound is sufficient.
For a positive even integer we will use the notation . We will consider signals of length , where is a power of and is the dimension. We use the notation for the root of unity of order . The -dimensional forward and inverse Fourier transforms are given by
respectively, where . We will denote the forward Fourier transform by and Note that we use the orthonormal version of the Fourier transform. We assume that the input signal has entries of polynomial precision and range. Thus, we have for all (Parseval’s identity). Given access to samples of , we recover a signal such that
We will use pseudorandom spectrum permutations, which we now define. We write for the set of matrices over with odd determinant. For and let . Since , this is a permutation. Our algorithm will use to hash heavy hitters into buckets, where we will choose . We will often omit the subscript and simply write when is fixed or clear from context. For we let be the “offset” of relative to (note that this definition is different from the one in [IK14]). We will always have , where is a power of .
Suppose that exists . For we define the permutation by .
The proof is given in [IK14] and we do not repeat it here. Define
In this paper, we assume knowledge of (a constant factor upper bound on suffices). We also assume that the signal to noise ration is bounded by a polynomial, namely that . We use the notation to denote the ball of radius around : , where , and is the circular distance on . We will also use the notation to denote . For a real number we write to denote the positive part of , i.e. if and otherwise.
We will use the filter constructed in [IK14]. The filter is defined by a parameter that governs its decay properties. The filter satisfies and
Lemma 2.3 (Lemma 3.1 in [Ik14]).
One has (1) for all such that and (2) for all as long as and (3) for all as long as is even.
Property (3) was not stated explicitly in Lemma 3.1 of [IK14], but follows directly from their construction.
The properties above imply that most of the mass of the filter is concentrated in a square of side , approximating the “ideal” filter (whose value would be equal to for entries within the square and equal to outside of it). Note that for each one has . We refer to the parameter as the sharpness of the filter. Our hash functions are not pairwise independent, but possess a property that still makes hashing using our filters efficient:
Lemma 2.5 (Lemma 3.2 in [Ik14]).
Let . Let be uniformly random with odd determinant. Then for all one has .
Pseudorandom spectrum permutations combined with a filter give us the ability to ‘hash’ the elements of the input signal into a number of buckets (denoted by ). We formalize this using the notion of a hashing. A hashing is a tuple consisting of a pseudorandom spectrum permutation , target number of buckets and a sharpness parameter of our filter, denoted by . Formally, is a function that maps a signal to signals, each corresponding to a hash bucket, allowing us to solve the -sparse recovery problem on input by reducing it to -sparse recovery problems on the bucketed signals. We give the formal definition below.
Definition 2.6 (Hashing ).
For a permutation , parameters , and , a hashing is a function mapping a signal to signals , where for each , such that for each
where is a filter with buckets and sharpness constructed in Lemma 2.3.
For a hashing we sometimes write to denote . We will consider hashings of the input signal , as well as the residual signal , where
Definition 2.7 (Measurement ).
For a signal , a hashing and a parameter , a measurement is the -dimensional complex valued vector of evaluations of a hashing at , i.e. length , indexed by and given by evaluating the hashing at , i.e. for
where is a filter with buckets and sharpness constructed in Lemma 2.3.
For any and any hashing define the vector by letting for every
We access the signal in Fourier domain via the function , which evaluates the hashing of residual signal at point , i.e. computes the measurement (the computation is done with polynomial precision). One can view this function as “hashing” into bins by convolving it with the filter constructed above and subsampling appropriately. The pseudocode for this function is given in section 9.2. In what follows we will use the following properties of HashToBins:
There exists a constant such that for any dimension , any integer , any , if are selected uniformly at random, the following conditions hold.
Let , , where is the filter with buckets and sharpness constructed in Lemma 2.3, and let . Then if , for any
For any one has . Furthermore, ;
for any hashing , if is chosen uniformly at random from , one has
Here is an absolute constant that can be chosen arbitrarily large at the expense of a factor of in runtime.
The proof of Lemma 2.9 is given in Appendix A. We will need several definitions and lemmas from [IK14], which we state here. We sometimes need slight modifications of the corresponding statements from [IK14], in which case we provide proofs in Appendix A. Throughout this paper the main object of our analysis is a properly defined set that contains the ’large’ coefficients of the input vector . Below we state our definitions and auxiliary lemmas without specifying the identity of this set, and then use specific instantiations of to analyze outer primitives such as ReduceL1Norm, ReduceInfNorm and RecoverAtConstSNR. This is convenient because the analysis of all of these primitives can then use the same basic claims about estimation and location primitives. The definition of given in (4) above is the one we use for analyzing ReduceL1Norm and the SNR reduction loop. Analysis of ReduceInfNorm (section 8.1) and RecoverAtConstantSNR (section 8.2) use different instantiations of , but these are local to the corresponding sections, and hence the definition in (4) is the best one to have in mind for the rest of this section.
First, we need the definition of an element being isolated under a hashing . Intuitively, an element is isolated under hashing with respect to set if not too many other elements are hashed too close to . Formally, we have
Definition 2.10 (Isolated element).
Let , where , . We say that an element is isolated under hashing at scale if
We say that is simply isolated under hashing if it is isolated under at all scales .
The following lemma shows that any element is likely to be isolated under a random permutation :
For any integer and any , if for smaller than an absolute constant, , and a hashing is chosen randomly (i.e. are chosen uniformly at random, and ), then each is isolated under permutation with probability at least .
The proof of the lemma is very similar to Lemma 5.4 in [IK14] (the only difference is that the ball is centered at the point that hashes to in Lemma 2.11, whereas it was centered at in Lemma 5.4 of [IK14]) and is given in Appendix A for completeness.
As every element is likely to be isolated under one random hashing, it is very likely to be isolated under a large fraction of hashings :
For any integer , and any , if for smaller than an absolute constant, , , a sequence of random hashings, then every is isolated with respect to under at least hashings with probability at least .
Follows by an application of Chernoff bounds and Lemma 2.11. ∎
It is convenient for our location primitive (LocateSignal, see Algorithm 1) to sample the signal at pairs of locations chosen randomly (but in a correlated fashion). The two points are then combined into one in a linear fashion. We now define notation for this common operation on pairs of numbers in . Note that we are viewing pairs in as vectors in dimension , and the operation below is just the dot product over this two dimensional space. However, since our input space is already endowed with a dot product (for we denote their dot product by ), having special notation here will help avoid confusion.
Operations on vectors in .
For a pair of vectors we let denote the vector such that
Note that for any one has , where addition for elements of is componentwise. We write for the all ones vector in dimension , and for the zero vector. For a set and a vector we denote
Definition 2.13 (Balanced set of points).
For an integer we say that a (multi)set is -balanced in coordinate if for every at least fraction of elements in the set belong to the left halfplane in the complex plane, where is the -th root of unity.
Note that if divides , then for any fixed value of the point is uniformly distributed over the -th roots of unity for some between and for every when is uniformly random in . Thus for we expect at least half the points to lie in the halfplane . A set is balanced if it does not deviate from expected behavior too much. The following claim is immediate via standard concentration bounds:
There exists a constant such that for any a power of two, , and a power of the following holds if . If elements of a (multi)set of size are chosen uniformly at random with replacement from , then with probability at least one has that for every the set is -balanced in coordinate .
Since we only use one value of in the paper (see line 8 in Algorithm 1), we will usually say that a set is simply ‘balanced’ to denote the -balanced property for this value of .
3 The algorithm and proof overview
In this section we state our algorithm and give an outline of the analysis. The formal proofs are then presented in the rest of the paper (the organization of the rest of the paper is presented in section 4). Our algorithm (Algorithm 2), at a high level, proceeds as follows.
Measuring . The algorithms starts by taking measurements of the signal in lines 5-16. Note that the algorithm selects hashings , where are selected uniformly at random, and for each selects a set of size that determines locations to access in frequency domain. The signal is accessed via the function HashToBins (see Lemma 2.9 above for its properties. The function HashToBins accesses filtered versions of shifted by elements of a randomly selected set (the number of shifts is ). These shifts are useful for locating ‘heavy’ elements from the output of HashToBins. Note that since each hashing takes samples, the total sample complexity of the measurement step is . This is the dominant contribution to sample complexity, but it is not the only one. The other contribution of comes from invocations of EstimateValues from our -SNR reduction loop (see below). The loop goes over iterations, and in each iteration EstimateValues uses fresh hash functions to keep the number of false positives and estimation error small.
The location algorithm is Algorithm 1. Our main tool for bounding performance of LocateSignal is Theorem 3.1, stated below. Theorem 3.1 applies to the following setting. Fix a set and a set of hashings that encode signal measurement patterns, and let denote the set of elements of that are not isolated with respect to most of these hashings. Theorem 3.1 shows that for any signal and partially recovered signal , if denotes the output list of an invocation of LocateSignal on the pair with measurements given by and a set of random shifts, then the norm of elements of the residual that are not discovered by LocateSignal can be bounded by a function of the amount of mass of the residual that fell outside of the ‘good’ set , plus the ‘noise level’ times .
If we think of applying Theorem 3.1 iteratively, we intuitively get that the fixed set of measurements given by hashings allows us to always reduce the norm of the residual on the ‘good’ set to about the amount of mass that is located outside of this good set(this is exactly how we use LocateSignal in our signal to noise ratio reduction loop below). In section 6 we prove
For any constant there exist absolute constants such that for any , , any integer and any such that , where , the following conditions hold if .
Let denote permutations, and let , , where for smaller than a constant. Let denote the set of elements that are not isolated with respect to at least a fraction of hashings . Then if additionally for every the sets are balanced in coordinate (as per Definition 2.13) for all , and , then
Reducing signal to noise ratio.
Once the samples have been taken, the algorithm proceeds to the signal to noise (SNR) reduction loop (lines 17-23). The objective of this loop is to reduce the mass of the top (about ) elements in the residual signal to roughly the noise level (once this is done, we run a ‘cleanup’ primitive, referred to as RecoverAtConstantSNR, to complete the recovery process – see below). Specifically, we define the set of ‘head elements’ in the original signal as
where is the average tail noise level. Note that we have . Indeed, if , more than elements of belong to the tail, amounting to more than tail mass. Ideally, we would like this loop to construct and approximation to supported only on such that , i.e. the -SNR of the residual signal on the set of heavy elements is reduced to a constant. As some false positives will unfortunately occur throughout the execution of our algorithm due to the weaker sublinear time location and estimation primitives that we use, our SNR reduction loop is to construct an approximation to with the somewhat weaker properties that
Thus, we reduce the -SNR on the set of ‘head’ elements to a constant, and at the same time not introduce too many spurious coefficients (i.e. false positives) outside , and these coefficients do not contribute much mass. The SNR reduction loop itself consists of repeated alternating invocations of two primitives, namely ReduceL1Norm and ReduceInfNorm. Of these two the former can be viewed as performing most of the reduction, and ReduceInfNorm is naturally viewed as performing a ‘cleanup’ phase to fix inefficiencies of ReduceL1Norm that are due to the small number of hash functions (only as opposed to in [IK14]) that we are allowed to use, as well as some mistakes that our sublinear runtime location and estimation primitives used in ReduceL1Norm might make.
ReduceL1Norm is presented as Algorithm 3 below. The algorithm performs rounds of the following process: first, run LocateSignal on the current residual signal, then estimate values of the elements that belong to the list output by LocateSignal, and only keep those that are above a certain threshold (see threshold in the call the EstimateValues in line 9 of Algorithm 3). This thresholding operation is crucial, and allows us to control the number of false positives. In fact, this is very similar to the approach of [IK14] of recovering elements starting from the largest. The only difference is that (a) our ‘reliability threshold’ is dictated by the norm of the residual rather than the norm, as in [IK14], and (b) some false positives can still occur due to our weaker estimation primitives. Our main tool for formally stating the effect of ReduceL1Norm is Lemma 3.2 below. Intuitively, the lemma shows that ReduceL1Norm reduces the norm of the head elements of the input signal by a polylogarthmic factor, and does not introduce too many new spurious elements (false positives) in the process. The introduced spurious elements, if any, do not contribute much mass to the head of the signal. Formally, we show in section 7.1
For any , any integer , for smaller than an absolute constant and the following conditions hold for the set , where . Suppose that .
For any sequence of hashings , , if denotes the set of elements of that are not isolated with respect to at least a fraction of the hashings , then for any , , if is a parameter such that
the following conditions hold.