Recursive Sparse Point Process Regression with Application to Spectrotemporal Receptive Field Plasticity Analysis

# Recursive Sparse Point Process Regression with Application to Spectrotemporal Receptive Field Plasticity Analysis

Alireza Sheikhattar, Jonathan B. Fritz, Shihab A. Shamma, and Behtash Babadi A. Sheikhattar is with the Department of Electrical and Computer Engineering (ECE), University of Maryland, College Park, MD; J. B. Fritz is with the Institute for Systems Research (ISR), University of Maryland, College Park, MD; S. A. Shamma and B. Babadi are with ECE and ISR, University of Maryland, College Park, MD (e-mails: arsha89@umd.edu; ripple@umd.edu; sas@umd.edu; behtash@umd.edu).Corresponding author: B. Babadi (e-mail: behtash@umd.edu).
###### Abstract

We consider the problem of estimating the sparse time-varying parameter vectors of a point process model in an online fashion, where the observations and inputs respectively consist of binary and continuous time series. We construct a novel objective function by incorporating a forgetting factor mechanism into the point process log-likelihood to enforce adaptivity and employ -regularization to capture the sparsity. We provide a rigorous analysis of the maximizers of the objective function, which extends the guarantees of compressed sensing to our setting. We construct two recursive filters for online estimation of the parameter vectors based on proximal optimization techniques, as well as a novel filter for recursive computation of statistical confidence regions. Simulation studies reveal that our algorithms outperform several existing point process filters in terms of trackability, goodness-of-fit and mean square error. We finally apply our filtering algorithms to experimentally recorded spiking data from the ferret primary auditory cortex during attentive behavior in a click rate discrimination task. Our analysis provides new insights into the time-course of the spectrotemporal receptive field plasticity of the auditory neurons.

Adaptive filtering; point process models; compressed sensing; neural signal processing; receptive field plasticity.

## I Introduction

Analyses of spiking activity recorded from sensory neurons have revealed three main features: first, neuronal activity is stochastic and exhibits significant variability across trials; second, the spiking statistics often undergo rapid changes referred to as neuronal plasticity, in order to adapt to changing stimulus salience and behavioral context; and third, the tuning characteristics of sensory neurons to the stimuli exhibit a degree of sparsity. Examples include place cells in the hippocampus [1] and spectrotemporally tuned cells in the primary auditory cortex [2]. Hence, in order to gain insight into the functional mechanism of the underlying neural system, it is crucial to have a mathematical theory to simultaneously capture the stochasticity, dynamicity and sparsity of neuronal activity.

On one hand, the theory of point processes [3] has been recently adopted as a mathematical framework to model the stochasticity of neuronal data. Traditionally, these models have been used to predict the likelihood of self-exciting processes such as earthquake occurrences [4, 5], but have recently found significant applications in the analysis of neuronal data [6, 7, 8, 9, 10, 11, 12].

On the other hand, classic results in signal processing such as the Least Mean Squares (LMS) and Recursive Least Squares (RLS) algorithms [13] have created a framework to efficiently capture the dynamics of the parameters in linear observation models. Existing solutions in computational neuroscience have adopted this framework to estimate the dynamics of neuronal activity. For instance, in [7] an LMS-type point process filter was introduced to study plasticity in hippocampal neurons. In [14], more general adaptive filtering solutions based on approximations to the Chapman-Kolmogorov equation were introduced. Although quite powerful in analyzing neuronal data, these solutions do not account for the sparsity of the underlying parameters.

Finally, the theory of compressed sensing (CS) has provided a novel methodology for measuring and estimating statistical models governed by sparse underlying parameters [15, 16, 17]. In particular, for static linear and generalized linear models (GLM) with random covariates and sparsity of the parameters, the CS theory characterizes sharp trade-offs between the number of measurement, sparsity, and estimation accuracy [16, 18]. The sparse solutions of CS are typically achieved using batch-mode convex programs and greedy techniques. In online settings, sparse adaptive filters have only been introduced in the context of linear systems governed by sparse parameters such as communication channels [19, 20, 21].

Despite significant progress in all these research fronts, a unified framework to simultaneously capture the stochasticity, dynamicity and sparsity of neuronal data is lacking. In this paper, we close this gap by integrating techniques from point process theory, adaptive filtering, and compressed sensing. To this end, we consider the problem of estimating time-varying stimulus modulation coefficients (e.g., receptive fields) from a sequence of binary observations in an online fashion. We model the spiking activity by a conditional Bernoulli point process, where the conditional intensity is a logistic function of the stimulus and its time lags. We will then design a novel objective function by incorporating the forgetting factor mechanism of RLS-type algorithms into the -regularized maximum likelihood estimation of the point process parameters. We will present theoretical guarantees that extend those of CS theory and characterize fundamental trade-offs between the number of measurements, forgetting factor, model compressibility, and estimation error of the underlying point processes in the non-asymptotic regime. We will next develop two adaptive filters for recursive estimation of the objective function based on proximal gradient techniques, as well as a filter for recursive computation of statistical confidence regions.

In order to validate our algorithms, we provide simulation studies which reveal that the proposed adaptive filtering algorithms significantly outperform existing point process filters in terms of goodness-of-fit, mean square error and trackability. We finally apply our proposed filters to multi-unit spiking data from ferret primary auditory cortex (A1) during passive stimulus presentation and during performance of a click rate discrimination task [22] in order to characterize the spectrotemporal receptive field (STRF) plasticity of A1 neurons. Application of our algorithm to these data provides new insights into the time course of attention-driven STRF plasticity, with over 3 orders of magnitude increase in temporal resolution from minutes to centiseconds, while capturing the underlying sparsity in a robust fashion. Aside from their theoretical significance, our results are particularly important in light of the recent technological advances in neural prostheses, which require real-time robust neuronal system identification from limited data.

The rest of the paper is organized as follows: In Section II, we present our notational conventions, preliminaries and problem formulation. In Section III, we introduce the main theoretical results of this paper, including the construction and stability analysis of the objective function, recursive filter development, and computation of confidence regions. Section IV provides numerical simulations as well as application to real data, followed by our concluding remarks in Section V. Technical details of Section III are presented in Appendices AC.

## Ii Preliminaries and Problem Definition

We first give a brief introduction to point process models (see [3] for a detailed treatment). We will use the following notation throughout the paper. Parameter vectors are denoted by bold-face greek letters. For example, denotes an -dimensional parameter vector, with denoting the transpose operator.

Consider a stochastic process defined by a sequence of discrete events at random points in time, noted by , and a counting measure given by

 dN(t)=J∑k=1δ(t−tk),andN(t)=∫t0dN(u),

where is the Dirac’s measure. The Conditional Intensity Function (CIF) for this process, denoted by , is defined as

 λ(t|Ht):=limε→0P(N(t+ε)−N(t)=1|Ht)ε, (1)

where denotes the history of the process as well as the covariates up to time . The CIF can be interpreted as the instantaneous rate given the history of the process and the covariates. A point process with a CIF given by is defined as:

1. Given , the random variables are conditionally mutually independent.

2. For any , is a Poisson random variable with probability distribution

 P(N(t2)−N(t1)=k)=(∫t2t1λ(t|Ht)dt)ke−∫t2t1λ(t|Ht)dtk!.

A point process model is fully characterized by its CIF. For instance, corresponds to the homogenous Poission process with rate . A discretized version of this process can be obtained by binning within an observation interval of by bins of length , that is

 nt:=N(tΔ)−N((t−1)Δ),t=1,2,⋯,T (2)

where . Throughout this paper, will be considered as the observed spiking sequence, which will be used for estimation purposes. Also, by approximating Eq. (1) for small , and defining , we have:

 P(nt=0)=1−λtΔ+o(Δ),P(nt=1)=λtΔ+o(Δ),P(nt≥2)=o(Δ). (3)

In discrete time, the orderliness of the process is equivalent to the requirement that with high probability not more than one event fall into any given bin. In practice, this can always be achieved by choosing small enough. An immediate consequence of Eq. (3) is that can be approximated by a sequence of Bernoulli random variables with success probabilities .

A popular class of models for the CIF is given by Generalized Linear Models (GLM). In its general form, a GLM consists of two main components: an observation model (which is given by (3) in this paper) and an equation expressing some (possibly nonlinear) function of the observation mean as a linear combination of the covariates. In neuronal systems, the covariates consist of extrinsic covariates (e.g., neural stimuli) as well as intrinsic covariates (e.g., the history of the process). In this paper, we only consider GLM models with purely extrinsic covariates, although most of our results can be generalized to incorporate intrinsic covariates as well.

Let denote the stimulus at time bin , denote the vector of stimulus modulation parameters, and denote the baseline firing rate. We adopt a logistic regression model for the CIF as follows:

 logit(λtΔ):=log(λtΔ1−λtΔ)=μ+M−2∑i=0θist−i (4)

By defining and , we can equivalently write:

 λtΔ=logit−1(ω′xt):=exp(ω′xt)1+exp(ω′xt) (5)

The model above is also known as the logistic-link CIF model. Another popular model in the computational neuroscience literature is the log-link model where . The significance of the logistic-link model is that maps the real line to the unit probability interval , making it a feasible model for describing statistics of binary events independent of the scaling of the covariates and modulation parameters.

Despite capturing the stimulus dependence in quite a general form, the GLM model in (5) represents a static model. We therefore generalize this model to the dynamic setting by allowing temporal variability of the modulation parameters:

 λtΔ=logit−1(ω′txt)=exp(ω′txt)1+exp(ω′txt) (6)

where represents the time-varying parameter vector at time . Throughout the rest of the paper, we refer to and as the covariate vector and the modulation parameter vector at time , respectively.

In our applications of interest, the modulation parameter vector exhibits a degree of sparsity [23, 24]. That is, only certain components in the stimulus modulation have significant contribution in determining the statistics of the process. These components can be thought of as the preferred or intrinsic tuning features of the underlying neuron. To be more precise, for a sparsity level , we denote by the support of the highest elements of in absolute value, and by the best -term approximation to . We also define

 σL(ω):=∥ω−ωL∥1 (7)

to capture the compressibility of the parameter vector . Recall that for , the -norm is defined as . When , the parameter is called -sparse, and when is small compared to , the parameter is called -compressible [25].

Finally, the main estimation problem of this paper can be stated as follows: given binary observations and covariates from a point process with a CIF given by Eq. (6), the goal is to estimate the -dimensional -compressible parameter vectors in an online and stable fashion.

## Iii Main Results

In this section, we will first describe the construction of an appropriate objective function for addressing our main estimation problem. We will then present a rigorous analysis of the maximizers of the objective function, which extends the results of CS to our setting. Next, we will introduce two adaptive filters to recursively maximize the objective function based on proximal gradient techniques. Finally, we will outline how statistical confidence bounds can also be constructed in a recursive fashion for our estimates.

### Iii-a ℓ1-regularized Exponentially Weighted Maximum Likelihood (ML)

Before proceeding with the construction of the objective function, we need to introduce more notational conventions. In order to have a framework allowing multi-timescale dynamics, we consider piece-wise constant dynamics for the parameter . That is, we assume that remains constant over windows of arbitrary length samples, for some integer . By segmenting the corresponding spiking data into windows of length samples each, we assume that the CIF for each time point is governed by , for . Note that number of spiking samples is assumed to be an integer multiple of window size , without loss of generality.

Invoking the Bernoulli approximation to the Poisson statistics for , the log-likelihood of the observation at time can be expressed as:

 logp(nt) ≈ntlog(λtΔ)+(1−nt)log(1−λtΔ) =nt(x′tωt)−log(1+exp(x′tωt)). (8)

Assuming conditional independence of the spiking events, the joint log-likelihood of the observations within window is given by:

 Lk(ωk):=W∑j=1 n(k−1)W+jx′(k−1)W+jωk −log(1+exp(x′(k−1)W+jωk)) (9)

In order to explicitly enforce adaptivity in the log-likelihood function, we adopt the forgetting factor mechanism of the RLS algorithm, where the log-likelihood of each window is exponentially weighted regressively in time, with a forgetting factor . That is, the effective data log-likelihood up to and including window is taken to be:

 Lβ(ωk):=k∑i=1βk−iLi(ωk) (10)

for some . Note what for , coincides with the natural data log-likelihood. Moreover, if we replace the Bernoulli log-likelihood with the Gaussian log-likelihood, then coincides with the conventional RLS objective function.

Next, in order to explicitly enforce sparsity, we adopt the -regularization mechanism of CS. That is, at window , we seek an estimate of the form:

 ˆωk=argmaxωk{Lβ(ωk)−γ∥ωk∥1} (11)

where is a regularization parameter controlling the trade off between the log-likelihood fit and the sparsity of estimated parameters. Our theoretical analysis in the next subsection reveals appropriate choices for , and the trade-offs therein.

### Iii-B Stability Analysis of the Objective Function

In order to quantify the trade-offs involving our choice of the objective function in Eq. (11), we proceed in the tradition of performance analysis result of the RLS algorithm [13] by characterizing the geometric properties of the estimates in a stationary environment where for all . Our analysis, however, is quite general and avoids ad hoc assumptions such as direct averaging or covariate independence which are usually invoked in the analysis of least squares problems.

We need to make the following technical assumptions for our analysis:

1) The stimulus sequence consists of independent (but not necessarily identically distributed) random variables with a variance of which are uniformly bounded by a constant in absolute value. Note that with this assumption, two successive covariate vectors, say at times and , given respectively by and are highly dependent, as they have random variables in common. Hence, the independent assumption used in studying least squares problem is violated.

2) We further assume that for all times , , for some constants and , i.e., the probability of spiking does not reach its extremal values of and , but can get arbitrarily close. This assumption can be realized due to the boundedness of the covariates and appropriate normalization of the stimulus modulation coefficients, and does not result in any practical loss of generality.

We have the following theoretical result regarding the stability of the maximizers of the objective function:

###### Theorem 1

Suppose that binary observations from a point process with a CIF given by Eq. (6) are given over windows of length each. Consider a stationary environment with for all and suppose that is -compressible. Then, under assumptions (1) and (2), for a fixed positive constant , there exist constants , , and such that for , and a choice of , any solution to (11) satisfies the bound

 ∥ˆω−ω∥2≤C√(1−β)LlogM+√CσL(ω)4√(1−β)LlogM,

with probability at least . The constants , , and are only functions of , , , , , and , and are explicitly given in Appendix A.

{proof}

The proof is given in Appendix A.

Remarks. The result of Theorem 1 has four major implications. First, the error bound scales with , the sparsity level, as opposed to , the ambient dimension of the parameter vector, which is consistent with results from CS, and results in the robustness of the estimate when the underlying parameter is sparse. Note that the bounds holds for general non-sparse , but is sharpest when is negligible, i.e., the parameter vector is nearly -sparse.

Second, the theorem prescribes a lower bound on the forgetting factor akin to the bounds obtained in CS theory for the total number of observations. For instance, the result of [26] for CS under Toeplitz sensing measurements for the linear model requires number of measurements to achieve a similar scaling of the error bound. In our case, the role of the number of measurements is transferred to forgetting factor by taking as the effective length of the measurements. In the absence of the forgetting factor (), by a careful limiting process, our results require measurements. The latter case can be compared to the result of [18] for point process models with independent and identically distributed covariate vectors, which requires for stability. The loss of is incurred due to the shift structure and hence high dependence of the covariate vectors in our case, as exemplified in assumption (1).

Third, the theorem reveals the scaling of the regularization parameter in terms of and . In particular, this scaling is significant as it reveals another role for the forgetting factor mechanism: not only the forgetting factor mechanism allows for adaptivity of the estimates, it also controls the scaling of the -regularization term with respect to the log-likelihood term. Fourth, unlike conventional results in the analysis of adaptive filters which concern the expectation of the error in the asymptotic regime, our result holds for a single realization with probability polynomially approaching 1, in the non-asymptotic regime.

Note that the objective function is clearly concave, and assuming that the matrix of the covariate vectors is full-rank, will be strictly concave with a unique maximizer. However, the result of Theorem 1 does not require the uniqueness of the maximizer and holds for any maximizer of the objective function. In the next section, we will proceed with the development of recursive filters to track the maximizer of the objective function in the more general time-varying setting.

### Iii-C Algorithm Development

Several standard optimization techniques, such as interior point methods, can be used to find the maximizer of (11). However, most of these techniques operate in batch mode and do not meet the real-time requirements of the adaptive filtering setting where the observations arrive in a streaming fashion. In order to avoid the increasing runtime complexity and memory requirements of the batch-mode computation, we seek a recursive approach which can perform low-complexity updates in an online fashion upon the arrival of new data in order to form the estimates. To this end, we adopt the proximal gradient approach. A version of the proximal gradient algorithm is given in Appendix B. Each iteration of the algorithm moves the previous iterate along the gradient of the log-likelihood function, which will then pass through a shrinkage operator.

Before describing further details, we introduce a more compact notation for convenience. Let denote the vector of observed spikes within window , for . Similarly, let denote the vector of CIFs within window . By extending the domain of the to vectors in a component-wise fashion, we can express as:

 λkΔ=logit−1(Xkωk) (12)

where is the data matrix of size with rows corresponding to the covariate vectors in window . Suppose that at window , we have an iterate denoted by , for , with being an integer denoting the total number of iterations. The gradient of evaluated at can be written as:

 ∇ωLβ(ˆω(ℓ)k) =k∑i=1βk−iX′iεi(ˆω(ℓ)k)=:gk(ˆω(ℓ)k) (13)

where represents the innovation vector of the point process at window . The innovation vector can be thought of as the counterpart of the conventional innovation vector in adaptive filtering of linear models. The proximal gradient iteration for the -regularization can be written in the compact form as:

 ˆω(ℓ+1)k =Sγα(ˆω(ℓ)k+αgk(ˆω(ℓ)k)) (14)

where is the element-wise soft thresholding operator at a level of given in Appendix B. The final estimate at window is obtained following the th iteration, and is denoted by . In order to achieve a recursive updating rule for , we can rewrite Eq. (13) as:

 gk(ˆω(ℓ)k) =βgk−1(ˆω(ℓ)k)+X′kεk(ˆω(ℓ)k) (15)

However, in an adaptive setting, we only have access to values of evaluated at ! In order to turn Eq. (15) into a fully recursive updating rule, all the previous CIF vectors should be recalculated at the most recent set of iterates . In order to overcome this computational burden, we exploit the smoothness of the logistic function and employ the Taylor series expansion of the CIF to approximate the required recursive update. In what follows, we consider the zeroth order and first order expansions, which result in two distinct, yet fully recursive, updating rules for Eq. (15).

Zeroth Order Expansion: By retaining only the first term in the Taylor series expansion of the CIF around , we get:

 (16)

)

where . Substituting this approximation in Eq. (13), we can express the zeroth order approximation to the gradient at window , denoted by , as:

 g0k(ˆω(ℓ)k)=k∑i=1βk−iX′iεi(ˆωi) (17)

It is then straightforward to obtain a recursive form as:

 g0k(ˆω(ℓ)k) =β g0k−1(ˆω(ℓ)k)+X′kεk(ˆω(ℓ)k) (18)

The shrinkage step will be then given by:

 ˆω(ℓ+1)k =Sγα(ˆω(ℓ)k+αg0k(ˆω(ℓ)k)) (19)

We refer to the resulting filter as the -regularized Point Process Filter of the Zeroth Order (-PPF). A pseudo-code is given in Algorithm 1.

First Order Expansion: If instead, we retain the first two terms in the Taylor expansion, Eq. (16) will be replaced by:

 (20)

where is a diagonal matrix with the th diagonal element given by . Using the first order approximation above, we can improve the resulting approximation to the gradient, denoted by , as:

 g1k(ˆω(ℓ)k)=k∑i=1βk−iX′i(εi(ˆωi)−Λi(ˆωi)Xi(ˆω(ℓ)k−ˆωi))

By defining:

 u(ℓ)k :=k∑i=1βk−iX′i(εi(ˆωi)+Λi(ˆωi)Xiˆωi) (21) B(ℓ)k :=k∑i=1βk−iX′iΛi(ˆωi)Xi, (22)

we can express as:

 g1k(ˆω(ℓ)k)=u(ℓ)k−B(ℓ)kˆω(ℓ)k. (23)

It is then straightforward to check that both and can be updated recursively [19] as:

 u(ℓ)k =βu(R)k−1+X′k(εk(ˆω(ℓ)k)+Λk(ˆω(ℓ)k)Xkˆω(ℓ)k) B(ℓ)k =βB(R)k−1+X′kΛk(ˆω(ℓ)k)Xk

Note that the update rules for both and involve simple rank- operations. The shrinkage step is then given by:

 ˆω(ℓ+1)k (24)

We refer to the resulting filter as the -regularized Point Process Filter of the First Order (-PPF). A pseudo-code is given in Algorithm 2.

Remark. The computational complexity of -PPF and -PPF algorithms can be shown to be linear and quadratic in per iteration, respectively. Our results in Section IV will reveal that both filters outperform existing filters of the same complexity, respectively. Furthermore, -PPF exhibits superior performance over -PPF as expected, although with a cost of in computational complexity per iteration.

### Iii-D Constructing Confidence Intervals

Characterizing the statistical confidence bounds for the estimates is of utmost importance in neural data analysis, as it allows to test the validity of various hypotheses. Although construction of confidence bounds for linear models in the absence of regularization is well understood and widely applied, regularized ML estimates are usually deemed as point estimates for which the construction of statistical confidence regions is not straightforward. A series of remarkable results in high-dimensional statistics [27, 28, 29] have recently addressed this issue by providing techniques to construct confidence intervals for -regularized ML estimates of linear models as well as GLMs. These approaches are based on a careful inspection of the Karush-Kuhn-Tucker (KKT) conditions for the regularized estimates, which admits a procedure to decompose the estimates into a bias term plus an asymptotically Gaussian term (referred to as ’de-sparsifying’ in [28]), which can be computed using a nodewise regression [30] of the covariates.

In what follows, we give a brief description of how the methods of [28] apply to our setting, and leave the details to Appendix C. Using the result of [28], the estimate as the maximizer of (11) can be decomposed as:

 ˆωk=ˆΘkgk(ˆωk)+ˆwk (25)

where is an approximate inverse to the Hessian of evaluated at , is the gradient of previously defined in Eq. (13), and is an unbiased and asymptotically Gaussian random vector with a covariance matrix of , with

 Gk(ˆωk):=k∑i=1β2(k−i)X′iεi(ˆωk)εi(ˆωk)′Xi. (26)

The first term in Eq. (25) is a bias term which can be directly computed given . Given , statistical confidence bounds for the second term at desired levels can be constructed in a standard way. The main technical issue in the aforementioned procedure in our setting is the computation of in a recursive fashion. Since the rows of are computed using -regularized least squares, we use the SPARLS algorithm [19] as an efficient method to carry out the computation in a recursive fashion. Algorithm 3 summarized the recursive computation of confidence intervals for the th component of .

## Iv Applications

In this section, we will apply the proposed algorithms to simulated data as well as real spiking data from the ferret primary auditory cortex. In our simulation studies, we compare the performance of our proposed filters with two of the state-of-the-art point process filters, namely the steepest descent point process filter (SDPPF) [7] and the stochastic state point process filter (SSPPF) [14]. These adaptive filters are based on approximate solutions to the Chapman-Kolmogorov forward equation obtained by a steepest descent and a Gaussian approximation procedure, respectively.

### Iv-a Simulation Study 1: MSE and Sparse Recovery Learning Curves

First, we consider a stationary environment where is constant over time. We use a bin size of and window size of sample, for a total observation window of (). The length of the parameter vector is chosen as . For each realization, we draw a sparse parameter vector of fixed length and sparsity . The support and values of the nonzero components of are chosen randomly and the values are normalized so that . The stimulus input sequence is drawn from an i.i.d. Gaussian distribution . The binary spike train is generated as a single realization of conditionally independent Bernoulli trials with success rate . The stimulus variance is chosen as small enough so that the average spiking rate to ensure that the Bernoulli approximation is valid. All the simulations are done with iteration per time step. The step size is chosen as (See Appendix B for details).

For a given pair of parameters, we select an optimal value for the regularization parameter by performing a two-fold even-odd cross validation procedure: first, the data are split into two sets of even and odd samples in an interleaved manner. Then, one set is used as the training set for estimation of the parameter vector and the other is used to assess the goodness-of-fit of the estimates with respect to the log-likelihood of the observations. We repeat the process switching the role of the two sets and take the average as the overall measure of fit.

Let denote the averaging operator with respect to realizations. We consider two performance metrics: the normalized mean squared error (MSE) defined as to evaluate MSE performance at time step ; and the out-of-support energy defined as to represent a sparsity metric (SPM). Ideally, must be equal to zero at all times. The averaging is carried out over a sufficiently large number of runs.

Figure 1 shows the corresponding learning curves for the four algorithms. According to Figure 1–A, the -PPF achieves the lowest stationary MSE measure of dB, followed -PPF which achieves an MSE of  dB. The SSPPF and SDPPF algorithms respectively achieve an MSE of dB and dB, which reveals a gap of  dB with respect to our proposed filters. Note that this result is consistent with the prediction of Theorem 1 and highlights the MSE gain achieved by -regularization, as opposed to ML, when the underlying parameter is sparse.

### Iv-B Simulation Study 2: Tracking and Goodness-of-fit Performance

In the second simulation scenario, we consider a more realistic setting where evolves in time. Furthermore, as in the case of real data applications, we assume that the support of is not available as a performance benchmark and resort to statistical goodness-of-fit test. These tests for point process models have been developed as an application of the time-rescaling theorem [31, 32] and consist of the Kolmogorov-Smirnov (KS) test for assessing the conditional intensity estimation accuracy, and the Autocovariance Function (ACF) test to assess the conditional independence assumption. We skip the details in the interest of space, and refer the readers to the aforementioned references for a detailed treatment.

As in the previous case, we consider a bin size of , window size of , and a total observation window of ( bins). The stimulus is generated as in the previous case. For the parameter vector , we choose a fixed baseline rate of to set the baseline spiking rate to , and select a sparse modulation vector of length with a support of size , and respective values of for . Halfway through the test, at , the largest component , drops rapidly and linearly to , within a window of length and remains zero for the rest of the run.

Figure 2 shows the performance of all four algorithms in the aforementioned setting. Each row (A through D) shows the true time-varying parameter vector (dashed traces) as well as the filtered estimates (solid traces) in the left panel. In particular, the gray solid traces show the out-of-support components which must ideally be equal to zero. The colored hulls around show the confidence intervals (note that confidence intervals for SDPPF cannot be directly obtained and require averaging over multiple realizations). The middle and right panels show the KS and ACF test results at a confidence, respectively. For the quadratic algorithms -PPF and SSPPF, a forgetting factor of is chosen. The regularization parameter for -PPF is chosen as , obtained by the aforementioned two-fold even-odd cross validation. For the first order algorithm -PPF, a smaller forgetting factor of is chosen to ensure stability, and a value of is used based on cross validation. These settings ensure that all the algorithms are tuned in their optimal operating point for fairness of comparison.

Figure 2–A and 2–B reveal three striking performance gaps between the two second-order algorithms (with the same computational complexity, quadratic in ): first, the out-of-support components (gray traces) of -PPF are significantly smaller than those of SSPPF; second, the confidence regions of -PPF are narrower than those of SSPPF; and third, -PPF fully passes the KS test, while SSPPF marginally does so. Similarly, comparing the two first order algorithms (with the same computational complexity, linear in ) Figure 2–C and 2–D reveal that the -PPF significantly suppresses the out-of-support components as compared to SDPPF. Moreover, -PPF provides confidence bounds, which cannot be directly obtained for SDPPF. Finally, -PPF marginally fails the KS test, whereas SDPPF does so significantly. Both algorithms fail the ACF test, which shows that the second-order corrections embedded in -PPF and SSPPF is necessary to achieve a better goodness-of-fit, which a price of higher computational complexity.

We also inspect the estimated firing probability for the four algorithms in Figure 3. In addition, we include the probability estimated by the normalized reverse correlation (NRC) method, which is commonly used in neural data analysis, and fits the modulation parameters using a linear model. Figure 3 shows the true spiking probability (blue solid trace) and the resulting spikes (black vertical lines). In the subsequent rows (B through F), the true and estimated probabilities are shown by dashed blue and solid red traces, respectively. A comparison of all the rows reveals that -PPF and -PPF outperform SSPPF and SDPPF, respectively, in terms of estimating the true probability. The NRC method is inferior to the preceding four algorithms, and results in negative estimates of the probability due to its use of a linear model (as opposed to logistic).

### Iv-C Application to Real Data: Dynamic Analysis of Spectrotemporal Receptive Field Plasticity

The responses of neurons in the primary auditory cortex (A1) can be characterized by their spectrotemporal receptive fields (STRFs), where each neuron is tuned to a specific region in the time-frequency plane, and only significantly spikes when the acoustic stimulus contains spectrotemporal contents matching its tuning region [2] (See Figure 4, top row, leftmost panel). Several experimental studies have revealed that receptive fields undergo rapid changes in their characteristics during attentive behavior in order to capture salient stimulus modulations [22, 33, 34]. In [22], it is suggested that this rapid plasticity has a significant role in the functional processes underlying active listening. However, most of the widely-used estimation techniques (e.g., normalized reverse correlation) provide static estimates of the receptive field with a a temporal resolution of the order of minutes. Moreover, they do not systematically capture the inherent sparsity manifested in the receptive field characteristics.

In the context of our model, the STRF can be modeled as an -dimensional matrix, where and denote the number of time lags and frequency bands, respectively. By vectorizing this matrix, we obtain an -dimensional vector at window , where . Augmenting the baseline rate parameter , we can model the activity of the A1 neurons using the logistic CIF with a parameter . The stimulus vector at time , is given by the vectorized version of the spectrogram of the acoustic stimulus with frequency bands and lags. In order to capture the sparsity of the STRF in the time-frequency plane, we further represent over a Gabor time-frequency dictionary consisting of Gaussian windows centered around a regular subset of the time-frequency plane. That is, for , where is the dictionary matrix and is the sparse representation of the STRF. The estimation procedures of this paper can be applied to , by absorbing the dictionary matrix into the data matrix at window .

We apply our proposed adaptive filter -PPF to multi-unit spike recordings from the ferrets A1 during a series of passive listening conditions and active auditory task conditions. During each active task, ferrets attended to the temporal dynamics of the sounds, and discriminated the rate of acoustic clicks [33]. The STRFs were estimated from the passive condition, where the quiescent animal listened to a series of broadband noise-like acoustic stimuli known as Temporally Orthogonal Ripple Combinations (TORC). The experiment consisted of 2 active and 11 passive blocks. Within each passive block, 30 TORCs were randomly repeated a total of 4-5 times each. In our analysis, we pool the spiking data corresponding to the same repeated TORC within each block. Therefore, the time axis corresponds to the experiment time modulo repetitions within each block. We discretize the resulting duration of to time bins of size , and segment data to windows of size samples (). The STRF dimensions are , regularly spanning lags of 1 to and frequency bands of to (in logarithmic scale). The dictionary consists of Gabor atoms, evenly spaced within the STRF domain. Each atom is a two-dimensional Gaussian kernel with a variance of per dimension, where denotes the spacing between the atoms. We selected a forgetting factor of , a step size of , where is the average variance of the spectrogram components, iterations per sample, and a regularization parameter of via two-fold even-odd cross validation.

Figure 4, top row, depicts five snapshots taken at corresponding to the end-points of the th passive tasks. The bottom row shows the time-course of five selected points (marked as A through D in the leftmost panel of the top row) of the STRF during the experiment. The STRF snapshots at times and correspond to after the two active tasks, respectively, and verify the sharpening effect of the excitatory region () due to the animal’s attentive behavior following the active task reported in [22]. Moreover, the STRF snapshots at times and reveal the weakening of the excitatory region long after the active task and returning to the pre-active state, highlighting the plasticity of A1 neurons. Previous studies have revealed the STRF dynamics with a resolution of the order of minutes [34]. Our result in Figure 4 provides a temporal resolution of the order of centiseconds (3 orders of magnitude increase), while capturing the STRF sparsity in a robust fashion.

## V Concluding Remarks

In this paper, we considered recursive estimation of the time-varying parameter vectors in a logistic regression model for binary time series driven by continuous input. To this end, we integrated several techniques from compressed sensing, adaptive filtering, optimization and statistics. We constructed an objective function which enjoys from the trackability features of the RLS-type algorithms, sparsifying features of -minimization, and unlike the rate-based linear models commonly used to analyze spiking data, takes into account the binary statistics of the observations. We analyzed the maximizers of the objective function in a rigorous fashion, revealing novel trade-offs between various model parameters. We constructed two adaptive filters, with respective linear and quadratic complexity requirements, for recursive maximization of the objective function in an online setting. Moreover, we characterized the statistical confidence regions for our estimates, and devised a recursive procedure to compute them efficiently.

Although we specialized our treatment to logistic statistics and -regularization, our approach to algorithm development has a plug-and-play feature: other GLM link functions (e.g., log-link) with possibly history dependent covariates and other regularization schemes (e.g., re-weighted , or group-sparse regularization) can be used instead and result in a large class of adaptive filters for sparse point process regression. We tested the performance of our algorithms on simulated as well as experimentally recorded spiking data. Our simulation studies revealed that the proposed filters outperform several existing point process filters. Application of our filters to real data from the ferret primary auditory cortex provided a high-resolution characterization of the time-course of spectrotemporal receptive field plasticity, with 3 orders of magnitudes increase in temporal resolution. Although we focused on auditory neurons, we expect a similar superior performance of our filters when applied to other sensory or motor neurons (e.g., cells in primary or supplementary motor cortex [35]).

## Appendix A Proof of Theorem 1

The proof is mainly based on the beautiful treatment of Negahban et al. [18]. The major difficulty in our case lies in the high inter-dependence of the covariates, which form a Toeplitz structure due to the setup of adaptive filtering. We address the latter issue by adopting techniques from another remarkable paper by Haupt et al. [26] to deal with the underlying interdependence. In the process, we also employ concentration inequalities for dependent random variables due to van de Geer [36].

In order to proceed, we adopt the notion of Strong Restricted Convexity (RSC) introduced in [18]. For a twice differentiable log-likelihood with respect to , the RSC property or order implies the existence of a lower quadratic bound on the negative log-likelihood:

 DL(Δ,ω):=−Lβ(ω+Δ)+Lβ(ω)+Δ′∇Lβ(ω)≥κ∥Δ∥22,

for a positive constant and all satisfying:

 ∥ΔSc∥1≤3∥ΔS∥1+4σS(ω). (27)

for any index set of cardinality . The following key lemma establishes the RSC for :

###### Lemma 1

Let denote a sequence of covariates and let denote the corresponding logistic parameters. Then, for a fixed positive constant , there exist constants and such that for and the negative log-likelihood satisfies the RSC of order with constant with probability greater than . The constants and are only functions of , , , , , , and are explicitly given in the proof.

{proof}

The proof is inspired by the elegant treatment of Negahban et al. [18]. The major difficulty in our setting is the high interdependence of successive covariates due to the shift structure induced by the adaptive setting, whereas in [18], the matrix of covariates is composed of i.i.d. rows. Using the Taylor’s theorem, can be written as:

 (28)

with for some . Since by hypothesis , we have:

 exp(x′(i−1)W+jω⋆)(1+exp(x′(i−1)W+jω⋆))2≥pmin(1−pmax). (29)

We can therefore further lower bound by:

 DL(Δ,ω)≥pmin(1−pmax)σ2Nβ{Δ′CβΔ}, (30)

where , and

 Cβ:=1σ2NβK∑i=1W∑j=1βK−ix(i−1)W+jx′(i−1)W+j. (31)

Note that the matrix has highly inter-dependent elements due to the Toeplitz structure in the adaptive design. In order to establish the RSC condition, we show the stronger Restricted Eigenvalue (RE) property, which in turn implies RSC [37]. Let be fixed. To do so, we need to bound the eigenvalues of , the restriction of to a subset of columns and rows corresponding to indices in with , for some integer .

Without loss of generality, we assume that the first entry of the covariate vectors is replaced by instead of , for presentational simplicity of the following treatment. For , we have:

 (Cβ)m,m′=1σ2NβK∑i=1W−1∑j=0βK