Compressed Sensing Beyond the I.I.D. and Static Domains:
Theory, Algorithms and Applications
⌉ \deptElectrical and Computer Engineering
Min Wu \coprincipaladviserBehtash Babadi \firstreaderShaul Druckmann \secondreaderRadu Balan \thirdreaderPrakash Narayan \fourthreaderJohnathan Fritz
ABSTRACT
Title of dissertation:  COMPRESSED SENSING BEYOND THE IID 
AND STATIC DOMAINS:  
THEORY, ALGORITHMS AND APPLICATIONS  
Abbas Kazemipour  
Doctor of Philosophy, 2017  
Dissertation directed by:  Professors Min Wu and Behtash Babadi 
Department of Electrical and Computer Engineering 
Sparsity is a ubiquitous feature of many real world signals such as natural images and neural spiking activities. Conventional compressed sensing utilizes sparsity to recover low dimensional signal structures in high ambient dimensions using few measurements, where i.i.d measurements are at disposal. However real world scenarios typically exhibit non i.i.d and dynamic structures and are confined by physical constraints, preventing applicability of the theoretical guarantees of compressed sensing and limiting its applications. In this thesis we develop new theory, algorithms and applications for non i.i.d and dynamic compressed sensing by considering such constraints.
In the first part of this thesis we derive new optimal samplingcomplexity tradeoffs for two commonly used processes used to model dependent temporal structures: the autoregressive processes and selfexciting generalized linear models. Our theoretical results successfully recovered the temporal dependencies in neural activities, financial data and traffic data.
Next, we develop a new framework for studying temporal dynamics by introducing compressible statespace models, which simultaneously utilize spatial and temporal sparsity. We develop a fast algorithm for optimal inference on such models and prove its optimal recovery guarantees. Our algorithm shows significant improvement in detecting sparse events in biological applications such as spindle detection and calcium deconvolution.
Finally, we develop a sparse Poisson image reconstruction technique and the first compressive twophoton microscope which uses lines of excitation across the sample at multiple angles. We recovered diffractionlimited images from relatively few incoherently multiplexed measurements, at a rate of 1.5 billion voxels per second. \beforepreface
Preface
Sparsity, coherence and dynamics are among the ubiquitous features of many real world signals and systems. Examples include natural images, sounds and neural spiking activities. These signals exhibit sparsity, that is there exists a basis for which the effective dimension of the signal is much smaller than its ambient dimension. Moreover natural signals entail rich temporal dynamics. Theory of compressed sensing [71, 45, 49, 50, 145, 40] is concerned with reconstruction of such signals by utilizing their sparse structures, using very few measurements . Compressed sensing provides sharp tradeoffs between the number of measurement, sparsity, and estimation accuracy when random i.i.d measurements are at disposal. However, in most practical applications of interest, the measured signals and the corresponding covariates are highly interdependent and follow specific temporal dynamics. Although, the theory of compressed sensing has not considered noni.i.d and dynamic domains, the recovery algorithms suggested by it show remarkable performance once the structure of the underlying signal is taken into account.
On a high level, research in compressive sensing is conducted in three main branches: theory, algorithms and applications. In this thesis we revisit all three branches for nonlinear models with interdependent covariates as well as dynamic compressive sensing with applications in neural signal processing, traffic modeling, financial data and media forensics. Three main proposed problems, our ongoing research and future work are included in details in the following chapters. The rest of this thesis is organized as follows: In Chapter 2 we introduce the problem of stable estimation of highorder AR processes, generalizing results from i.i.d compressive sensing to general class of stable AR processes with subGaussian innovations. In doing so we will show that spectral properties of stationary processes determine the interdependence of the covariates and in general the samplingcomplexity tradeoffs of AR processes. In Chapter 3 we introduce the problem of robust estimation of generalized linear models. We will prove theoretical guarantees of compressive sensing for such models, characterizing the samplingcomplexity tradeoffs between the interdependence of the covariates and the number of measurement. We further corroborate our theoretical guarantees with simulated data as well as an application to retinal ganglion cell spiking activities. In Chapter 4 we provide theoretical guarantees of stable state estimation where the underlying dynamics follow a compressible statespace model. Moreover we show application of our results in denoising and spike deconvolution from calcium imaging recordings of neural spiking activities and show promising recovery results from compressed imaging data. We propose an application of this method in designing compressive calcium imaging devices. Finally, in Chapter 6 we use ideas from projection microscopy to develop a twophoton imaging technique that scans lines of excitation across the sample at multiple angles, recovering highresolution images from relatively few incoherently multiplexed measurements. By using a static image of the sample as a prior, we recover neural activity from megapixel fields of view at rates exceeding 1kHz.
Acknowledgment
I feel very fortunate to have the chance to learn from and collaborate with incredibly talented individuals in the past 5 years. First and foremost, I would like to thank my academic advisor, Professor Min Wu, who supported and believed in me throughout my PhD. I feel extremely indebted to her, for her patience, critical thinking and giving me the opportunity to follow my research interests. Second, I would like to thank my coadvisor, Professor Behtash Babadi. His unique blend of creativity, generosity and energy makes him a role model for my future career. I would also like to thank my mentors at Janelia Research Campus. I would like to thank Professor Shaul Druckmann and Dr. Kaspar Podgorski for their incredible support of my work and great contributions to my thesis.
I would like to thank my committee members and group leaders at Janelia for their encouragements and my collaborators at University of Maryland and Janelia who provided critical feedback on my work. Many of the interesting research problems and ideas came from fruitful discussions with them. I would also like to thank Professor Ali Olfat at University of Tehran for his incredible support and believing in me so early in my career.
I would like to thank my friends at UMD and Janelia, who made the past 5 years a plausible experience for me.
Last but not least, I would like to thank my family, especially my parents who always chose my happiness over their own convenience. Thank you for filling my life with love.
Notations Throughout this thesis we use bold lower and upper case letters for denoting vectors and matrices, respectively. Parameter vectors are denoted by boldface Greek letters. For example, denotes a dimensional parameter vector, with denoting the transpose operator. For a vector , we define its decomposition into positive and negative parts given by:
where . It can be shown that
are convex in . Similarly, for a summation
where , . We will use the notation to denote the vector for any with . We will denote the estimated values by and the biased estimates with the superscript . Throughout the proofs, ’s express absolute constants which may change from line to line where there is no ambiguity. By we mean an absolute constant which only depends on a positive constant . We denote the support of a vector by and its th element by .
Given a sparsity level and a vector , we denote the set of its largest magnitude entries by , and its best term approximation error by . When for some , we refer to as –compressible.
For simplicity of notation, we define to be the allzero vector in . For a matrix , we denote restriction of to its first rows by .
We use the convention and , i.e. represents the th column of . and denote elementwise multiplication and division respectively. Throughout the chapter we will use the terms innovations and spikes interchangeably. Unless otherwise stated, a function acts on a vector elementwise. For a matrix its mixed norm is denoted by , i.e.
and .
Chapter 1 Introduction
Data compression and its fundamental limits are one of the main questions in communication theory [64]. Classical communication theory treats data acquisition and data compression as two separate problems and has studied each of these modules extensively, for example in information theory it is well known that the fundamental limit of data compression is given by the entropy of source. With the emergence of big data applications and the prohibitive cost of sensing mechanisms for high resolution data, smarter sensing mechanisms seem to be inevitable.
The theory of compressive sensing [81] takes its name from the premise that in many applications data acquisition and compression can be performed simultaneously. This is done by utilizing the sparse structure of many natural signals such as images, sound etc. For example, natural images are known to be sparse in Fourier bases. By taking advantage of the sparse structure one could go beyond the fundamental limits imposed by physical constraints and uncertainty principles.
Research in compressive sensing can be categorized in three major branches: mathematical theory, algorithm design and applications. In this thesis we focus on all three aspects. From a mathematical perspective, we use tools from empirical process theory and statistical signal processing in order to study sampling complexity tradeoffs of compressive sensing for dynamic and non i.i.d compressive sensing. This is motivated by the fact that interdependence and dynamic structures are ubiquitous features of natural signals. From the algorithmic perspective, we provide fast algorithms for signal recovery under these conditions, and finally we will apply our theory in several applications of interest.
In order to facilitate reading of this of thesis we will provide a very brief introduction to compressive sensing and a few key results which will be used recurrently throughout the thesis. A more detailed treatment can be found in [81].
1.1 Sparse Solutions of Underdetermined Systems
A vector is sparse if it has at most nonzero entries, i.e. if
We define
(1.1) 
and
(1.2) 
which are scalar functions of and , and capture the compressibility of the parameter vector in the and sense, respectively. Note that by definition . For a fixed , we say that is compressible if [145]. Note that when , the parameter vector is exactly sparse.
Let
(1.3) 
where is a known measurement matrix and consists of linear measurements of and is the bounded measurement noise satisfying
Compressive sensing problem is concerned with solving (1.3) in the underdetermined setting, i.e. when . In this setup, the key assumption of being (close to) an sparse vector can be used to recover . In the absence of observation noise can be exactly recovered from if rank, which requires at least measurements. Special designs of measurement matrices such as Vandermonde matrices or Fourier matrices have resulted in elegant reconstruction algorithms such as Prony’s method [157].
In the presence of noise one would need higher number of measurements . Ideally one would like to solve the optimization of finding the sparsest which satisfies the bounded noise constraints, i.e.
(1.4) 
Unfortunately (1.4) is an NPhard problem and due to the combinatorial nature of norm. The convex relaxation of (1.4) is therefore used most often, which is achieved by replacing the norm with an norm, i.e.
(1.5) 
1.2 Restricted Isometry Property
For the convex optimization problem (1.5) the following property is sufficient for stable recovery of :
Definition 1 (Restricted Isometry Property [50]).
Intuitively speaking, RIP requires that the linear measurements acts as an almost isometry on sparse vectors, resulting in invertibility of the underdetermined system of equations (1.3). The following formalizes this idea:
Theorem 1 (Implications of the RIP [47] ).
For an arbitrary value one can prove existence of a measurement matrix satisfying the RIP as long as namely:
Theorem 2 (RIP for Random Matrices [81] ).
Let be an subgaussian random matrix. Then there exits a constant depending only on subgaussian parameters, such that the restricted isometry constant of satisfies with probability at least provided
Setting yields the condition
which guarantees that with probability at least .
We will next discuss the commonly used reconstruction algorithms for sparse recovery.
1.3 Sparse Recovery Algorithms
Popular sparse recovery algorithms in compressive sensing can be divided into three main categories: optimization methods, greedy methods, and thresholdingbased methods [81]. In this thesis we focus on developing algorithms for the optimization methods and greedy methods.
1.3.1 Optimization Problems
Three of the most popular optimization problems for sparse recovery are the quadratically constrained basis pursuit, lasso [203] and basis pursuit denoising. The quadratically constrained basis pursuit algorithm solves the optimization problem
(1.9) 
Alternatively one can find the solution which is closest to the measurements while maintaining a controlled sparsity level. Such formulation is known as the lasso given by
(1.10) 
A closely related optimization problem via the lagrangian formulation of both problems is known as basis pursuit denoising, given by
(1.11) 
1.3.2 Greedy Algorithms
Although there exist fast solvers to convex problems of the type given by Eq. (1.9), these algorithms are polynomial time in and , and may not scale well with highdimensional data. This motivates us to consider greedy solutions for the estimation of sparse parameters. In particular, in this thesis we will consider generalizations of the Orthogonal Matching Pursuit (OMP) [158] for general convex cost functions. The main idea behind the OMP is in the greedy selection stage, where the absolute value of the gradient of the cost function at the current solution is considered as the selection metric. The OMP algorithm adds one index to the estimated support of at each step. A flowchart of the algorithm is given in Table (1.1).
The choice of the maximum number of iterations will be discussed in detail later. One can repeat the process until a stopping criterion is met. The advantage of the OMP algorithm is breaking the high ()dimensional optimization problem into several low ()dimensional problem which can usually be solved much faster.
1.4 Theoretical Guarantees for Convex Cost Functions
In many applications of interest the measurements are not linear, or the noise is not additive or Gaussian. In these scenarios the objective function is not in the form of squared errors. We will now introduce the theoretical requirements which generalize RIP to general convex cost functions, and the corresponding theoretical guarantees.
1.4.1 Restricted Strong Convexity
General convex cost functions are usually in the form of a loss function or a negative loglikelihood. Estimation problems for such cost functions are known as Mestimation problems. The general form of an Mestimation problem is given by
(1.12) 
and its regularized counterpart is given by
(1.13) 
where, is a cost function which is convex in , is a regularization parameter and is the convex set of admissible solutions.
The main sufficient condition for theoretical guarantees of general convex cost functions is the notion of Restricted Strong Convexity (RSC) [148]. By the convexity of the cost function, it is clear that a small change in results in a small change in the cost function. However, the converse is not necessarily true. Intuitively speaking, the RSC condition guarantees that the converse holds: a small change in the cost implies a small change in the parameter vector, i.e., the cost function is not too flat around the true parameter vector. A depiction of the RSC condition for , adopted from [148], is given in Figure 1.1. In Figure 1.1(a), the RSC does not hold since a change along does not change the loglikelihood, whereas the loglikelihood in Figure 1.1(b) satisfies the RSC.
More formally, if the loglikelihood is twice differentiable at , the RSC is equivalent to existence of a lower quadratic bound on the negative loglikelihood:
(1.14) 
for a positive constant and all in a carefullychosen neighborhood of depending on and . Based on the results of [148], when the RSC is satisfied, sufficient conditions akin to that in Theorem 1 can be obtained by estimating the Euclidean extent of the solution set around the true parameter vector. Here we restate the main result of [148] concerning RSC and its implications in controlling the estimation error for general convex cost functions:
Proposition 1 (Implications of RSC (Theorem 1 of [148])).
For a negative loglikelihood which satisfies the RSC with parameter , every solution to the convex optimization problem (1.13) satisfies
(1.15) 
with a choice of the regularization parameter
(1.16) 
The first term in the bound (1.16) is increasing in and corresponds to the estimation error of the largest components of in magnitude, whereas the second term is decreasing in and represents the cost of replacing with its best sparse approximation.
Similarly the counterpart of OMP for general cost functions was introduced in [235] and is summarized in Table 1.2.
The main theoretical result regarding the generalized OMP is given by the following Proposition stating that the greedy procedure is successful in obtaining a reasonable sparse approximation, if the cost function satisfies the RSC:
Proposition 2 (Guarantees of OMP (Theorem 2.1 of [235])).
Suppose that satisfies RSC with a constant . Let be a constant such that
(1.17) 
Then, we have
where satisfies
(1.18) 
1.5 Roadmap of the Thesis
As noted earlier most of the theoretical guarantees of compressive sensing provide sharp tradeoffs between the number of measurement, sparsity, and estimation accuracy when random i.i.d measurements are at disposal. However, in most practical applications of interest, the measured signals and the corresponding covariates are highly interdependent and follow specific dynamics. In Chapters 2 and 3 we will generalize these theoretical guarantees to two large classes of stationary processes, namely the autoregressive (AR) processes and generalized linear models (GLM), where the covariates are highly nonlinear and noni.i.d. Our theoretical results provide insight into the tradeoffs in sampling requirements and a measure of dependence in these models.
In Chapters 4 and 6 we provide fast iterative algorithms for compressible statespace models as well as general convex optimization problems with positivity constraints. We will use ideas from projection microscopy to develop a twophoton imaging technique that scans lines of excitation across the sample at multiple angles, recovering highresolution images from relatively few incoherently multiplexed measurements. By using a static image of the sample as a prior, we recover neural activity from megapixel fields of view at rates exceeding 1kHz.
Chapter 2 Sampling Requirements for Stable Autoregressive Estimation
Autoregressive (AR) models are among the most fundamental tools in analyzing time series. They have been useful for modeling signals in many applications including financial time series analysis [182] and traffic modeling [78, 9, 8, 25, 60, 178]. Due to their wellknown approximation property, these models are commonly used to represent stationary processes in a parametric fashion and thereby preserve the underlying structure of these processes [11]. In general, the ubiquitous longrange dependencies in realworld time series, such as financial data, results in AR model fits with large orders [182].
In this chapter, we close the gap in theory of compressed sensing for noni.i.d. data by providing theoretical guarantees on stable estimation of autoregressive, where the history of the process takes the role of the interdependent covariates. In doing so, we relax the assumptions of i.i.d. covariates and exact sparsity common in CS. Our results indicate that utilizing sparsity recovers important information about the intrinsic frequencies of such processes.
We consider the problem of estimating the parameters of a linear univariate autoregressive model with subGaussian innovations from a limited sequence of consecutive observations. Assuming that the parameters are compressible, we analyze the performance of the regularized least squares as well as a greedy estimator of the parameters and characterize the sampling tradeoffs required for stable recovery in the nonasymptotic regime. In particular, we show that for a fixed sparsity level, stable recovery of AR parameters is possible when the number of samples scale sublinearly with the AR order. Our results improve over existing sampling complexity requirements in AR estimation using the LASSO, when the sparsity level scales faster than the square root of the model order. We further derive sufficient conditions on the sparsity level that guarantee the minimax optimality of the regularized least squares estimate. Applying these techniques to simulated data as well as realworld datasets from crude oil prices and traffic speed data confirm our predicted theoretical performance gains in terms of estimation accuracy and model selection.
2.1 Introduction
Autoregressive (AR) models are among the most fundamental tools in analyzing time series. Applications include financial time series analysis [182] and traffic modeling [78, 9, 8, 25, 60, 178]. Due to their wellknown approximation property, these models are commonly used to represent stationary processes in a parametric fashion and thereby preserve the underlying structure of these processes [11]. In order to leverage the approximation property of AR models, often times parameter sets of very large order are required [169]. For instance, any autoregressive moving average (ARMA) process can be represented by an AR process of infinite order. Statistical inference using these models is usually performed through fitting a longorder AR model to the data, which can be viewed as a truncation of the infiniteorder representation [188, 86, 85, 105]. In general, the ubiquitous longrange dependencies in realworld time series, such as financial data, results in AR model fits with large orders [182].
In various applications of interest, the AR parameters fit to the data exhibit sparsity, that is, only a small number of the parameters are nonzero. Examples include autoregressive communication channel models, quasioscillatory data tuned around specific frequencies and financial time series [21, 136, 178]. The nonzero AR parameters in these models correspond to significant time lags at which the underlying dynamics operate. Traditional AR order selection criteria such as the Final Prediction Error (FPE) [13], Akaike Information Criterion (AIC) [12] and Bayesian Information Criterion (BIC) [185], are based on asymptotic lower bounds on the mean squared prediction error. Although there exist several improvements over these traditional results aiming at exploiting sparsity [105, 188, 221], the resulting criteria pertain to the asymptotic regimes and their finite sample behavior is not well understood [89]. Nonasymptotic results for AR estimation, such as [89, 144], do not fully exploit the sparsity of the underlying parameters in favor of reducing the sample complexity. In particular, for an AR process of order , sufficient sampling requirements of and are established in [89] and [144], respectively.
A relatively recent line of research employs the theory of compressed sensing (CS) for studying nonasymptotic samplingcomplexity tradeoffs for regularized Mestimators. In recent years, the CS theory has become the standard framework for measuring and estimating sparse statistical models [71, 45, 50]. The theoretical guarantees of CS imply that when the number of incoherent measurements are roughly proportional to the sparsity level, then stable recovery of the model parameters is possible. A key underlying assumption in many existing theoretical analyses of linear models is the independence and identical distribution (i.i.d.) of the covariates’ structure. The matrix of covariates is either formed by fully i.i.d. elements [180, 22], is based on rowi.i.d. correlated designs [237, 173], is Toeplitzi.i.d. [99], or circulant i.i.d. [175], where the design is extrinsic, fixed in advance and is independent of the underlying sparse signal. The matrix of covariates formed from the observations of an AR process does not fit into any of these categories, as the intrinsic history of the process plays the role of the covariates. Hence the underlying interdependence in the model hinders a straightforward application of existing CS results to AR estimation. Recent nonasymptotic results on the estimation of multivariate AR (MVAR) processes have been relatively successful in utilizing sparsity for such dependent structures. For Gaussian and lowrank MVAR models, respectively, sublinear sampling requirements have been established in [130, 96] and [147], using regularized LS estimators, under bounded operator norm assumptions on the transition matrix. These assumptions are shown to be restrictive for MVAR processes with lags larger than 1 [224]. By relaxing these boundedness assumptions for Gaussian, subGaussian and heavytailed MVAR processes, respectively, sampling requirements of and have been established in [28] and [224, 227]. However, the quadratic scaling requirement in the sparsity level for the case of subGaussian and heavytailed innovations incurs a significant gap with respect to the optimal guarantees of CS (with linear scaling in sparsity), particularly when the sparsity level is allowed to scale with .
In this chapter, we consider two of the widelyused estimators in CS, namely the regularized Least Squares (LS) or the LASSO and the Orthogonal Matching Pursuit (OMP) estimator, and extend the nonasymptotic recovery guarantees of the CS theory to the estimation of univariate AR processes with compressible parameters using these estimators. In particular, we improve the aforementioned gap between nonasymptotic sampling requirements for AR estimation and those promised by compressed sensing by providing sharper samplingcomplexity tradeoffs which improve over existing results when the sparsity grows faster than the square root of . Our focus on the analysis of univariate AR processes is motivated by the application areas of interest in this chapter which correspond to onedimensional time series. Existing results in the literature [130, 96, 147, 28, 224, 227], however, consider the MVAR case and thus are broader in scope. We will therefore compare our results to the univariate specialization of the aforementioned results. Our main contributions can be summarized as follows:
First, we establish that for a univariate AR process with subGaussian innovations when the number of measurements scales sublinearly with the product of the ambient dimension and the sparsity level , i.e., , then stable recovery of the underlying AR parameters is possible using the LASSO and the OMP estimators, even though the covariates are highly interdependent and solely based on the history of the process. In particular, when for some and the LASSO is used, our results improve upon those of [224, 227], when specialized to the univariate AR case, by a factor of . For the special case of Gaussian AR processes, stronger results are available which require a scaling of [28]. Moreover, our results provide a theorydriven choice of the number of iterations for stable estimation using the OMP algorithm, which has a significantly lower computational complexity than the LASSO.
Second, in the course of our analysis, we establish the Restricted Eigenvalue (RE) condition [31] for design matrices formed from a realization of an AR process in a Toeplitz fashion, when . To this end, we invoke appropriate concentration inequalities for sums of dependent random variables in order to capture and control the high interdependence of the design matrix. In the special case of a white noise subGaussian process, i.e., a subGaussian i.i.d. Toeplitz measurement matrix, we show that our result can be strengthened from to , which improves by a factor of over the results of [99] requiring .
Third, we establish sufficient conditions on the sparsity level which result in the minimax optimality of the regularized LS estimator. Finally, we provide simulation results as well as application to oil price and traffic data which reveal that the sparse estimates significantly outperform traditional techniques such as the YuleWalker based estimators [196]. We have employed statistical tests in time and frequency domains to compare the performance of these estimators.
The rest of the chapter is organized as follows. In Section 2.2, we will introduce the notations and problem formulation. In Section 2.3, we will describe several methods for the estimation of the parameters of an AR process, present the main theoretical results of this chapter on robust estimation of AR parameters, and establish the minimax optimality of the regularized LS estimator. Section 2.4 includes our simulation results on simulated data as well as the realworld financial and traffic data, followed by concluding remarks in Section 2.5.
2.2 Problem Formulation
Consider a univariate AR() process defined by
(2.1) 
where is an i.i.d subGaussian innovation sequence with zero mean and variance . This process can be considered as the output of an LTI system with transfer function
(2.2) 
Throughout the chapter we will assume to enforce the stability of the filter. We will refer to this assumption as the sufficient stability assumption, since an AR process with poles within the unit circle does not necessarily satisfy . However, beyond secondorder AR processes, it is not straightforward to state the stability of the process in terms of its parameters in a closed algebraic form, which in turn makes both the analysis and optimization procedures intractable. As we will show later, the only major requirement of our results is the boundedness of the spectral spread (also referred to as condition number) of the AR process. Although the sufficient stability condition is more restrictive, it will significantly simplify the spectral constants appearing in the analysis and clarifies the various tradeoffs in the sampling bounds (See, for example, Corollary 3).
The AR() process given by in (2.1) is stationary in the strict sense. Also by (2.2) the power spectral density of the process equals
(2.3) 
The sufficient stability assumption implies boundedness of the spectral spread of the process defined as
We will discuss how this assumption can be further relaxed in Appendix A.1.2. The spectral spread of stationary processes in general is a measure of how quickly the process reaches its ergodic state [89]. An important property that we will use later in this chapter is that the spectral spread is an upper bound on the eigenvalue spread of the covariance matrix of the process of arbitrary size [101].
We will also assume that the parameter vector is compressible (to be defined more precisely later), and can be well approximated by an sparse vector where . We observe consecutive snapshots of length (a total of samples) from this process given by and aim to estimate by exploiting its sparsity; to this end, we aim at addressing the following questions in the nonasymptotic regime:

Are the conventional LASSOtype and greedy techniques suitable for estimating ?

What are the sufficient conditions on in terms of and , to guarantee stable recovery?

Given these sufficient conditions, how do these estimators perform compared to conventional AR estimation techniques?
Traditionally, the YuleWalker (YW) equations or least squares formulations are used to fit AR models. Since these methods do not utilize the sparse structure of the parameters, they usually require samples in order to achieve satisfactory performance. The YW equations can be expressed as
(2.4) 
where is the covariance matrix of the process and is the autocorrelation of the process at lag . The covariance matrix and autocorrelation vector are typically replaced by their sample counterparts. Estimation of the AR() parameters from the YW equations can be efficiently carried out using the Burg’s method [43]. Other estimation techniques include LS regression and maximum likelihood (ML) estimation. In this chapter, we will consider the Burg’s method and LS solutions as comparison benchmarks. When is comparable to , these two methods are known to exhibit substantial performance differences [137].
When fitted to the realworld data, the parameter vector usually exhibits a degree of sparsity. That is, only certain lags in the history have a significant contribution in determining the statistics of the process. These lags can be thought of as the intrinsic delays in the underlying dynamics. To be more precise, for a sparsity level , we denote by the support of the largest elements of in absolute value, and by the best term approximation to . We also define
(2.5) 
which capture the compressibility of the parameter vector in the and sense, respectively. Note that by definition . For a fixed , we say that is compressible if [145] and compressible if . Note that compressibility is a weaker condition than compressibility and when , the parameter vector is exactly sparse.
Finally, in this chapter, we are concerned with the compressed sensing regime where , i.e., the observed data has a much smaller length than the ambient dimension of the parameter vector. The main estimation problem of this chapter can be summarized as follows: given observations from an AR process with subGaussian innovations and bounded spectral spread, the goal is to estimate the unknown dimensional compressible AR parameters in a stable fashion (where the estimation error is controlled) when .
2.3 Theoretical Results
In this section, we will describe the estimation procedures and present the main theoretical results of this chapter.
2.3.1 regularized least squares estimation
Given the sequence of observations and an estimate , the normalized estimation error can be expressed as:
(2.6) 
where
(2.7) 
Note that the matrix of covariates is Toeplitz with highly interdependent elements. The LS solution is thus given by:
(2.8) 
where
is the convex feasible region for which the stability of the process is guaranteed. Note that the sufficient constraint of is by no means necessary for stability. However, the set of all resulting in stability is in general not convex. We have thus chosen to cast the LS estimator of Eq. (2.8) –as well as its regularized version that follows– over a convex subset , for which fast solvers exist. In addition, as we will show later, this assumption significantly clarifies the various constants appearing in our theoretical analysis. In practice, the YuleWalker estimate is obtained without this constraint, and is guaranteed to result in a stable AR process. Similarly, for the LS estimate, this condition is relaxed by obtaining the unconstrained LS estimate and checking post hoc for stability [160].
Consistency of the LS estimator given by (2.8) was shown in [182] when for Gaussian innovations. In the case of Gaussian innovations the LS estimates correspond to conditional ML estimation and are asymptotically unbiased under mild conditions, and with fixed, the solution converges to the true parameter vector as . For fixed , the estimation error is of the order in general [99]. However, when is allowed to scale with , the convergence rate of the estimation error is not known in general.
In the regime of interest in this thesis, where , the LS estimator is illposed and is typically regularized with a smooth norm. In order to capture the compressibility of the parameters, we consider the regularized LS estimator:
(2.9) 
where is a regularization parameter. This estimator, deemed as the Lagrangian form of the LASSO [203], has been comprehensively studied in the sparse recovery literature [220, 118, 139] as well as AR estimation [118, 237, 144, 221]. A general asymptotic consistency result for LASSOtype estimators was established in [118]. Asymptotic consistency of LASSOtype estimators for AR estimation was shown in [237, 221]. For sparse models, nonasymptotic analysis of the LASSO with covariate matrices from rowi.i.d. correlated design has been established in [237, 220].
In many applications of interest, the data correlations are exponentially decaying and negligible beyond a certain lag, and hence for large enough , autoregressive models fit the data very well in the prediction error sense. An important question is thus how many measurements are required for estimation stability? In the overdetermined regime of , the nonasymptotic properties of LASSO for model selection of AR processes has been studied in [144], where a sampling requirement of is established. Recovery guarantees for LASSOtype estimators of multivariate AR parameters in the compressive regime of are studied in [130, 96, 147, 224, 28, 227]. In particular, sublinear scaling of with respect to the ambient dimension is established in [130, 96] for Gaussian MVAR processes and in [147] for lowrank MVAR processes, respectively, under the assumption of bounded operator norm of the transition matrix. In [28] and [224, 227], the latter assumption is relaxed for Gaussian, subGaussian, and heavytailed MVAR processes, respectively. These results have significant practical implications as they will reveal sufficient conditions on with respect to as well as a criterion to choose , which result in stable estimation of from a considerably short sequence of observations. The latter is indeed the setting that we consider in this chapter, where the ambient dimension is fixed and the goal is to derive sufficient conditions on resulting in stable estimation.
It is easy to verify that the objective function and constraints in Eq. (2.9) are convex in and hence can be obtained using standard numerical solvers. Note that the solution to (2.9) might not be unique. However, we will provide error bounds that hold for all possible solutions of (2.9), with high probability.
Recall that, the YuleWalker solution is given by
(2.10) 
where . We further consider two other sparse estimators for by penalizing the YuleWalker equations. The regularized YuleWalker estimator is defined as:
(2.11) 
where is a regularization parameter. Similarly, using the robust statistics instead of the Gaussian statistics, the estimation error can be redefined as:
we define the regularized estimates as
(2.12) 
2.3.2 Greedy estimation
Although there exist fast solvers for the convex problems of the type given by (2.9), (2.11) and (2.12), these algorithms are polynomial time in and , and may not scale well with the dimension of data. This motivates us to consider greedy solutions for the estimation of . In particular, we will consider and study the performance of a generalized Orthogonal Matching Pursuit (OMP) algorithm [158, 235]. A flowchart of this algorithm is given in Table 2.1 for completeness. At each iteration, a new component of for which the gradient of the error metric is the largest in absolute value is chosen and added to the current support. The algorithm proceeds for a total of steps, resulting in an estimate with components. When the error metric is chosen, the generalized OMP corresponds to the original OMP algorithm. For the choice of the YW error metric , we denote the resulting greedy algorithm by ywOMP.
2.3.3 Estimation performance guarantees
The main theoretical result regarding the estimation performance of the regularized LS estimator is given by the following theorem:
Theorem 3.
If , there exist positive constants and such that for and a choice of regularization parameter , any solution to (2.9) satisfies the bound
(2.13) 
with probability greater than . The constants depend on the spectral spread of the process and are explicitly given in the proof.
Similarly, the following theorem characterizes the estimation performance bounds for the OMP algorithm:
Theorem 4.
If is compressible for some , there exist positive constants and such that for , the OMP estimate satisfies the bound
(2.14) 
after iterations with probability greater than . The constants depend on the spectral spread of the process and are explicitly given in the proof.
The results of Theorems 3 and 6 suggest that under suitable compressibility assumptions on the AR parameters, one can estimate the parameters reliably using the regularized LS and OMP estimators with much fewer measurements compared to those required by the YuleWalker/LS based methods. To illustrate the significance of these results further, several remarks are in order:
Remark 1. The sufficient stability assumption of is restrictive compared to the class of stable AR models. In general, the set of parameters which admit a stable AR process is not necessarily convex. This condition ensures that the resulting estimates of (2.9)(2.12) pertain to stable AR processes and at the same time can be obtained by convex optimization techniques, for which fast solvers exist. A common practice in AR estimation, however, is to solve for the unconstrained problem and check for the stability of the resulting AR process post hoc. In our numerical studies in Section 2.4, this procedure resulted in a stable AR process in all cases. Nevertheless, the stability guarantees of Theorems 3 and 6 hold for the larger class of stable AR processes, even though they may not necessarily be obtained using convex optimization techniques. We further discuss this generalization in Appendix A.1.2.
Remark 2. When , i.e., the process is a subGaussian white noise and hence the matrix is i.i.d. Toeplitz with subGaussian elements, the constants and in Theorems 1 and 2 vanish, and the measurement requirements strengthen to and , respectively. Comparing this sufficient condition with that of [99] given by reveals an improvement of order by our results.
Remark 3. When , the dominant measurement requirements are and . Comparing the sufficient condition of Theorem 3 with those of [71, 45, 50, 220] for linear models with i.i.d. measurement matrices or rowi.i.d. correlated designs [237, 173] given by a loss of order is incurred, although all these conditions require . However, the loss seems to be natural as it stems from a major difference of our setting as compared to traditional CS: each row of the measurement matrix highly depends on the entire observation sequence , whereas in traditional CS, each row of the measurement matrix is only related to the corresponding measurement. Hence, the aforementioned loss can be viewed as the price of selfaveraging of the process accounting for the lowdimensional nature of the covariate sample space and the high interdependence of the covariates to the observation sequence. Recent results on Mestimation of sparse MVAR processes with subGaussian and heavytailed innovations [227, 224] require when specialized to the univariate case, which compared to our results improve the loss of to with the additional cost of quadratic requirement in the sparsity . However, in the overdetermined regime of for some , our results imply , providing a saving of order over those of [227, 224].
Remark 4. It can be shown that the estimation error for the LS method in general scales as [99] which is not desirable when . Our result, however, guarantees a much smaller error rate of the order . Also, the sufficiency conditions of Theorem 6 require high compressibility of the parameter vector (), whereas Theorem 3 does not impose any extra restrictions on . Intuitively speaking, these two comparisons reveal the tradeoff between computational complexity and measurement/compressibility requirements for convex optimization vs. greedy techniques, which are wellknown for linear models [41].
Remark 5. The condition in Theorem 3 is not restricting for the processes of interest in this chapter. This is due to the fact that the boundedness assumption on the spectral spread implies an exponential decay of the parameters (See Lemma 1 of [89]). Finally, the constants , are increasing with respect to the spectral spread of the process . Intuitively speaking, the closer the roots of the filter given by (2.2) get to the unit circle (corresponding to larger and smaller ), the slower the convergence of the process will be to its ergodic state, and hence more measurements are required. A similar dependence to the spectral spread has appeared in the results of [89] for regularized least squares estimation of AR processes.
Remark 6. The main ingredient in the proofs of Theorems 3 and 6 is to establish the restricted eigenvalue (RE) condition introduced in [31] for the covariates matrix . Establishing the RE condition for the covariates matrix is a nontrivial problem due to the high interdependence of the matrix entries. We will indeed show that if the sufficient stability assumption holds, then with the sample covariance matrix is sharply concentrated around the true covariance matrix and hence the RE condition can be guaranteed. All constants appearing in Theorems 3 and 6 are explicitly given in Appendix A.1.2. As a typical numerical example, for and , the constants of Theorem 3 can be chosen as , and . The full proofs are given in Appendix A.1.2.
2.3.4 Minimax optimality
In this section, we establish the minimax optimality of the regularized LS estimator for AR processes with sparse parameters. To this end, we will focus on the class of stationary processes which admit an AR() representation with sparse parameter such that . The theoretical results of this section are inspired by the results of [89] on nonasymptotic order selection via regularized LS estimation in the absence of sparsity, and extend them by studying the regularized LS estimator of (2.9).
We define the maximal estimation risk over to be
(2.15) 
The minimax estimator is the one minimizing the maximal estimation risk, i.e.,
(2.16) 
Minimax estimator , in general, cannot be constructed explicitly [89], and the common practice in nonparametric estimation is to construct an estimator which is order optimal as compared to the minimax estimator:
(2.17) 
with being a constant. One can also define the minimax prediction risk by the maximal prediction error over all possible realizations of the process:
(2.18) 
In [89], it is shown that an regularized LS estimator with an order is minimax optimal. This order pertains to the denoising regime where . Hence, in order to capture long order lags of the process, one requires a sample size exponentially large in , which may make the estimation problem computationally infeasible. For instance, consider a sparse parameter with only and being nonzero. Then, in order to achieve minimax optimality, measurements are required. In contrast, in the compressive regime where , the goal, instead of selecting , is to find conditions on the sparsity level , so that for a given and large enough , the regularized estimator is minimax optimal without explicit knowledge of the value of (See for example, [46]).
In the following proposition, we establish the minimax optimality of the regularized estimator over the class of sparse AR processes with :
Proposition 3.
Let be samples of an AR process with sparse parameters satisfying and . Then, we have:
where is a constant and is only a function of and and is explicitly given in the proof.
Remark 5. Proposition 3 implies that regularized LS is minimax optimal in estimating the sparse parameter vector , for small enough . The proof of the Proposition 3 is given in Appendix A.1.4. This result can be extended to compressible in a natural way with a bit more work, but we only present the proof for the case of sparse for brevity. We also state the following proposition on the prediction performance of the regularized LS estimator:
Proposition 4.
Let be samples of an AR process with sparse parameters and Gaussian innovations, then there exists a positive constant such that for large enough and satisfying , we have:
(2.19) 
It can be readily observed that for the prediction error variance is very close to the variance of the innovations. The proof is similar to Theorem 3 of [89] and is skipped in this chapter for brevity.
2.4 Application to Simulated and Real Data
In this section, we study and compare the performance of YuleWalker based estimation methods with those of the regularized and greedy estimators given in Section 2.3. These methods are applied to simulated data as well as real data from crude oil price and traffic speed.
2.4.1 Simulation studies
In order to simulate an AR process, we filtered a Gaussian white noise process using an IIR filter with sparse parameters. Figure 2.1 shows a typical sample path of the simulated AR process used in our analysis. For the parameter vector , we chose a length of , and employed generated samples of the corresponding process for estimation. The parameter vector is of sparsity level and . A value of is used, which is slightly tuned around the theoretical estimate given by Theorem 3. The order of the process is assumed to be known. We compare the performance of seven estimators: 1) using LS, 2) using the YuleWalker equations, 3) from regularized LS, 4) using OMP, 5) using Eq. (2.11), 6) using Eq. (2.12), and 7) using the cost function in the generalized OMP. Note that for the LS and YuleWalker estimates, we have relaxed the condition of , to be consistent with the common usage of these methods. The YuleWalker estimate is guaranteed to result in a stable AR process, whereas the LS estimate is not [160]. Figure 2.2 shows the estimated parameter vectors using these algorithms. It can be visually observed that regularized and greedy estimators (shown in purple) significantly outperform the YuleWalkerbased estimates (shown in orange).
In order to quantify the latter observation precisely, we repeated the same experiment for and . A comparison of the normalized MSE of the estimators vs. is shown in Figure 2.3. As it can be inferred from Figure 2.3, in the region where is comparable to or less than (shaded in light purple), the sparse estimators have a systematic performance gain over the YuleWalker based estimates, with the regularized LS and ywOMP estimates outperforming the rest.
The MSE comparison in Figure 2.3 requires one to know the true parameters. In practice, the true parameters are not available for comparison purposes. In order to quantify the performance gain of these methods, we use statistical tests to assess the goodnessoffit of the estimates. The common chisquare type statistical tests, such as the Ftest, are useful when the hypothesized distribution to be tested against is discrete or categorical. For our problem setup with subGaussian innovations, we will use a number of statistical tests appropriate for AR processes, namely, the KolmogorovSmirnov (KS) test, the Cramérvon Mises (CvM) criterion, the spectral Cramérvon Mises (SCvM) test and the AndersonDarling (AD) [65, 107, 15]. A summary of these tests is given in Appendix B.1. Table 2.2 summarizes the test statistics for different estimation methods. Cells colored in orange (darker shade in grayscale) correspond to traditional AR estimation methods and those colored in blue (lighter shade in grayscale) correspond to the sparse estimator with the best performance among those considered in this work. These tests are based on the known results on limiting distributions of error residuals. As noted from Table 2.2, our simulations suggest that the OMP estimate achieves the best test statistics for the CvM, AD and KS tests, whereas the regularized estimate achieves the best SCvM statistic.
\backslashboxEstimateTest  CvM  AD  KS  SCvM 

0.31  1.54  0.031  0.009  
0.68  5.12  0.037  0.017  
0.65  4.87  0.034  0.025  
0.34  1.72  0.030  0.009  
0.29  1.45  0.028  0.009  
0.35  1.80  0.032  0.009  
0.42  2.33  0.040  0.008  
0.29  1.46  0.030  0.009 
2.4.2 Application to the analysis of crude oil prices
In this and the following subsection, we consider applications with realworld data. As for the first application, we apply the sparse AR estimation techniques to analyze the crude oil price of the Cushing, OK WTI Spot Price FOB dataset [1]. This dataset consists of 7429 daily values of oil prices in dollars per barrel. In order to avoid outliers, usually the dataset is filtered with a moving average filter of high order. We have skipped this procedure by visual inspection of the data and selecting samples free of outliers. Such financial data sets are known for their nonstationarity and long order history dependence. In order to remove the deterministic trends in the data, onestep or twostep time differencing is typically used. We refer to [178] for a full discussion of this detrending method. We have used a firstorder time differencing which resulted in a sufficient detrending of the data. Figure 2.4 shows the data used in our analysis. We have chosen by inspection. The histogram of firstorder differences as well the estimates are shown in Figure 2.5.
A visual inspection of the estimates in Figure 2.5 shows that the regularized LS () and OMP () estimates consistently select specific time lags in the AR parameters, whereas the YuleWalker and LS estimates seemingly overfit the data by populating the entire parameter space. In order to perform goodnessoffit tests, we use an even/odd twofold crossvalidation. Table 2.3 shows the corresponding test statistics, which reveal that indeed the regularized and OMP estimates outperform the traditional estimation techniques.
\backslashboxEstimateTest  CvM  AD  KS  SCvM 

0.88  5.55  0.055  0.046  
0.58  3.60  0.043  0.037  
0.27  1.33  0.031  0.020  
0.22  1.18  0.025  0.022  
0.28  1.40  0.027  0.021  
0.24  1.26  0.027  0.022  
0.23  1.18  0.026  0.022 
2.4.3 Application to the analysis of traffic data
Our second real data application concerns traffic speed data. The data used in our simulations is the INRIX ® speed data for I495 Maryland inner loop freeway (clockwise) between US1/Baltimore Ave/Exit 25 and Greenbelt Metro Dr/Exit 24 from 1 Jul, 2015 to 31 Oct, 2015 [2, 3]. The reference speed of 65 mph is reported. Our aim is to analyze the longterm, largescale periodicities manifested in these data by fitting highorder sparse AR models. Given the huge length of the data and its high variability, the following preprocessing was made on the original data:

The data was downsampled by a factor of and averaged by the hour in order to reduce its daily variability, that is each lag corresponds to one hour.

The logarithm of speed was used for analysis and the mean was subtracted. This reduces the high variability of speed due to rush hours and lower traffic during weekends and holidays.
Figure 2.6 shows a typical average weekly speed and travel time in this dataset and the corresponding 2575th percentiles. As can be seen the data shows high variability around the rush hours of and . In our analysis, we used the first half of the data () for fitting, from which the AR parameters and the distribution and variance of the innovations were estimated. The statistical tests were designed based on the estimated distributions, and the statistics were computed accordingly using the second half of the data. We selected an order of by inspection and noting that the data seems to have a periodicity of order samples.
Figure 2.7 shows part of the data used in our analysis as well as the estimated parameters. The regularized LS () and OMP () are consistent in selecting the same components of . These estimators pick up two major lags around which has its largest components. The first lag corresponds to about hours which is mainly due to the rush hour periodicity on a daily basis. The second lag is around hours which corresponds to weekly changes in the speed due to lower traffic over the weekend. In contrast, the YuleWalker and LS estimates do not recover these significant time lags.
\backslashboxEstimateTest  CvM  AD  KS  SCvM 

0.012  0.066  0.220  0.05  
1.4  2.1  6.7  0.25  
0.017  0.082  0.220  1.49  
0.025  0.122  0.270  0.14 
Statistical tests for a selected subset of the estimators are shown in Table 2.4. Interestingly, the regularized LS estimator significantly outperforms the other estimators in three of the tests. The YuleWalker estimator, however, achieves the best SCvM test statistic.
2.5 Concluding Remarks
In this chapter, we investigated sufficient sampling requirements for stable estimation of AR models in the nonasymptotic regime using the regularized LS and greedy estimation (OMP) techniques. We have further established the minimax optimality of the regularized LS estimator. Compared to the existing literature, our results provide several major contributions. First, when for some , our results suggest an improvement of order in the sampling requirements for the estimation of univariate AR models with subGaussian innovations using the LASSO, over those of [224] and [227] which require for stable AR estimation. When specialized to a subGaussian white noise process, i.e., establishing the RE condition of i.i.d. Toeplitz matrices, our results provide an improvement of order over those of [99]. Second, although OMP is widely used in practice, the choice of the number of greedy iterations is often adhoc. In contrast, our theoretical results prescribe an analytical choices of the number of iterations required for stable estimation, thereby promoting the usage of OMP as a lowcomplexity algorithm for AR estimation. Third, we established the minimax optimality of the regularized LS estimator for the estimation of sparse AR parameters.
We further verified the validity of our theoretical results through simulation studies as well as application to real financial and traffic data. These results show that the sparse estimation methods significantly outperform the widelyused YuleWalker based estimators in fitting AR models to the data. Although we did not theoretically analyze the performance of sparse YuleWalker based estimators, they seem to perform on par with the regularized LS and OMP estimators based on our numerical studies. Finally, as we will see in the next chapter our results provide a striking connection to estimation of sparse selfexciting discrete point process models. These models regress an observed binary spike train with respect to its history via Bernoulli or Poisson statistics, and are often used in describing spontaneous activity of sensory neurons. Our results have shown that in order to estimate a sparse historydependence parameter vector of length and sparsity in a stable fashion, a spike train of length is required. This leads us to conjecture that these sublinear sampling requirements are sufficient for a larger class of autoregressive processes, beyond those characterized by linear models. Finally, our minimax optimality result requires the sparsity level to grow at most as fast as . We consider further relaxation of this condition, as well as the generalization of our results to sparse MVAR processes as future work.
Chapter 3 Robust Estimation of SelfExciting Generalized Linear Models
In this chapter, we close the gap in theory of compressed sensing for noni.i.d. data by providing theoretical guarantees on stable estimation of selfexciting generalized linear models. We consider the problem of estimating selfexciting generalized linear models from limited binary observations, where the history of the process serves as the covariate. In doing so, we relax the assumptions of i.i.d. covariates and exact sparsity common in CS. Our results indicate that utilizing sparsity recovers important information about the intrinsic frequencies of such processes. We analyze the performance of two classes of estimators, namely the regularized maximum likelihood and greedy estimators, for a canonical selfexciting process and characterize the sampling tradeoffs required for stable recovery in the nonasymptotic regime. Our results extend those of compressed sensing for linear and generalized linear models with i.i.d. covariates to those with highly interdependent covariates. We further provide simulation studies as well as application to real spiking data from the mouse’s lateral geniculate nucleus and the ferret’s retinal ganglion cells under different nonlinear forward models which agree with our theoretical predictions.
3.1 Introduction
The theory of compressed sensing (CS) has provided a novel framework for measuring and estimating statistical models governed by sparse underlying parameters [71, 45, 49, 50, 145, 40]. In particular, for linear models with random covariates and sparsity of the parameters, the CS theory provides sharp tradeoffs between the number of measurement, sparsity, and estimation accuracy. Typical theoretical guarantees imply that when the number of random measurements are roughly proportional to sparsity, then stable recovery of these sparse models is possible.
Beyond those described by linear models, observations from binary phenomena form a large class of data in natural and social sciences. Their ubiquity in disciplines such as neuroscience, physiology, seismology, criminology, and finance has urged researchers to develop formal frameworks to model and analyze these data. In particular, the theory of point processes provides a statistical machinery for modeling and prediction of such phenomena. Traditionally, these models have been employed to predict the likelihood of selfexciting processes such as earthquake occurrences [153, 217], but have recently found applications in several other areas. For instance, these models have been used to characterize heartbeat dynamics [23, 210] and violence among gangs [75]. Selfexciting point process models have also found significant applications in analysis of neuronal data [38, 191, 37, 155, 207, 156, 163].
In particular, point process models provide a principled way to regress binary spiking data with respect to extrinsic stimuli and neural covariates, and thereby forming predictive statistical models for neural spiking activity. Examples include place cells in the hippocampus [38], spectrotemporally tuned cells in the primary auditory cortex [44], and spontaneous retinal or thalamic neurons spiking under tuned intrinsic frequencies [33, 129]. Selfexciting point processes have also been utilized in assessing the functional connectivity of neuronal ensembles [115, 59]. When fitted to neuronal data, these models exhibit three main features: first, the underlying parameters are nearly sparse or compressible [206, 59]; second, the covariates are often highly structured and correlated; and third, the inputoutput relation is highly nonlinear. Therefore, the theoretical guarantees of compressed sensing do not readily translate to prescriptions for point process estimation.
Estimation of these models is typically carried out by Maximum Likelihood (ML) or regularized ML estimation in discrete time, where the process is viewed as a Generalized Linear Model (GLM). In order to adjust the regularization level, empirical methods such as crossvalidation are typically employed [59]. In the signal processing and information theory literature, sparse signal recovery under Poisson statistics has been considered in [142] with application to the analysis of ranking data. In [172], a similar setting has been studied, with motivation from imaging by photoncounting devices. Finally, in theoretical statistics, highdimensional estimators with decomposable regularizers, such as the norm, have been studied for GLMs [148].
A key underlying assumption in the existing theoretical analysis of estimating GLMs is the independence and identical distribution (i.i.d.) of covariates. This assumption does not hold for selfexciting processes, since the history of the process takes the role of the covariates. Nevertheless, regularized ML estimators show remarkable performance in fitting GLMs to neuronal data with history dependence and highly noni.i.d. covariates. In this chapter, we close this gap by presenting new results on robust estimation of compressible GLMs, relaxing the assumptions of i.i.d. covariates and exact sparsity common in CS.
In particular, we will consider a canonical GLM and will analyze two classes of estimators for its underlying parameters: the regularized maximum likelihood and greedy estimators. We will present theoretical guarantees that extend those of CS theory and characterize fundamental tradeoffs between the number of measurements, model compressibility, and estimation error of GLMs in the nonasymptotic regime. Our results reveal that when the number of measurements scale sublinearly with the product of the ambient dimension and a generalized measure of sparsity (modulo logarithmic factors), then stable recovery of the underlying models is possible, even though the covariates solely depend on the history of the process. We will further discuss the extensions of these results to more general classes of GLMs. Finally, we will present applications to simulated as well as real data from two classes of neurons exhibiting spontaneous activity, namely the mouse’s lateral geniculate nucleus and the ferret’s retinal ganglion cells, which agree with our theoretical predictions. Aside from their theoretical significance, our results are particularly important in light of the technological advances in neural prostheses, which require robust neuronal system identification based on compressed data acquisition.
The rest of the chapter is organized as follows: In Section 3.2, we present our notational conventions, preliminaries and problem formulation. In Section 3.3, we discuss the estimation procedures and state the main theoretical results of this chapter. Section 3.4 provides numerical simulations as well as application to real data. In Section 3.5, we discuss the implications of our results and outline future research directions. Finally, we present the proofs of the main theoretical results and give a brief background on relevant statistical tests in Appendices A.2–B.2.
3.2 Preliminaries and Problem Formulation
We first give a brief introduction to selfexciting GLMs (see [66] for a detailed treatment).
We consider a sequence of observations in the form of binary spike trains obtained by discretizing continuoustime observations (e.g. electrophysiology recordings), using bins of length . We assume that not more than one event fall into any given bin. In practice, this can always be achieved by choosing small enough. The binary observation at bin is denoted by . The observation sequence can be modeled as the outcome of conditionally independent Poisson or Bernoulli trials, with a spiking probability given by , where is the spiking probability at bin given the history of the process up to bin .
These models are widelyused in neural data analysis and are motivated by the continuous time point processes with history dependent conditional intensity functions [66]. For instance, given the history of a continuoustime point process up to time , a conditional intensity of corresponds to the homogeneous Poisson process. As another example, a conditional intensity of corresponds to a process known as the Hawkes process [100] with baseline rate and history dependence kernel . Under the assumption of the orderliness of a continuoustime point process, a discretized approximation to these processes can be obtained by binning the process by bins of length , and defining the spiking probability by . In this chapter, we consider discrete random processes characterized by the spiking probability , which are either inherently discrete or employed as an approximation to continuoustime point process models.
Throughout the rest of the chapter, we drop the dependence of on to simplify notation, denote it by and refer to it as spiking probability. Given the sequence of binary observed data , the negative loglikelihood function under the Bernoulli statistics can be expressed as:
(3.1) 
Another common likelihood model used in the analysis of neuronal spiking data corresponds to Poisson statistics [206], for which the negative loglikelihood takes the following form:
(3.2) 
Throughout the chapter, we will focus on binary observations governed by Bernoulli statistics, whose negative loglikelihood is given in Eq. (3.1). In applications such as electrophysiology in which neuronal spiking activities are recorded at a high sampling rate, the binning size is very small and the Bernoulli and Poisson statistics coincide.
When the discrete process is viewed as an approximation to a continuoustime process, these loglikelihood functions are known as the Jacod loglikelihood approximations [66]. We will present our analysis for the negative loglikelihood given by (3.1), but our results can be extended to other statistics including (3.2) (See the remarks of Section 3.3).
Throughout this chapter will be considered as the observed spiking sequence which will be used for estimation purposes. A popular class of models for is given by GLMs. In its general form, a GLM consists of two main components: an observation model and an equation expressing some (possibly nonlinear) function of the observation mean as a linear combination of the covariates. In neural systems, the covariates consist of external stimuli as well as the history of the process. Inspired by spontaneous neuronal activity, we consider fully selfexciting processes, in which the covariates are only functions of the process history. As for a canonical GLM inspired by the Hawkes process, we consider a process for which the spiking probability is a linear function of the process history:
(3.3) 
where is a positive constant representing the baseline rate, and