Convolutional dictionary learning based auto-encoders for natural exponential-family distributions

# Convolutional dictionary learning based auto-encoders for natural exponential-family distributions

## Abstract

We introduce a class of auto-encoder neural networks tailored to data from the natural exponential family (e.g., count data). The architectures are inspired by the problem of learning the filters in a convolutional generative model with sparsity constraints, often referred to as convolutional dictionary learning (CDL). Our work is the first to combine ideas from convolutional generative models and deep learning for data that are naturally modeled with a non-Gaussian distribution (e.g., binomial and Poisson). This perspective provides us with a scalable and flexible framework that can be re-purposed for a wide range of tasks and assumptions on the generative model. Specifically, the iterative optimization procedure for solving CDL, an unsupervised task, is mapped to an unfolded and constrained neural network, with iterative adjustments to the inputs to account for the generative distribution. We also show that the framework can easily be extended for discriminative training, appropriate for a supervised task. We demonstrate 1) that fitting the generative model to learn, in an unsupervised fashion, the latent stimulus that underlies neural spiking data leads to better goodness-of-fit compared to other baselines, 2) competitive performance compared to state-of-the-art algorithms for supervised Poisson image denoising, with significantly fewer parameters, and 3) gradient dynamics of shallow binomial auto-encoder.

## 1 Introduction

Learning shift-invariant patterns from a dataset has given rise to work in different communities, most notably in signal processing (SP) and deep learning. In the former, this problem is referred to as convolutional dictionary learning (CDL) [Garcia-Cardona and Wohlberg, 2018]. CDL imposes a linear generative model where the data are generated by a sparse linear combination of shifts of localized patterns. In the latter, convolutional neural networks (NNs) [LeCun et al., 2015] have excelled in identifying shift-invariant patterns.

Recently, the iterative nature of the optimization algorithms for performing CDL has inspired the utilization of NNs as an efficient and scalable alternative, starting with the seminal work of [Gregor and Lecun, 2010], and followed by [Tolooshams et al., 2018, Sreter and Giryes, 2018, Sulam et al., 2019]. Specifically, the iterative steps are expressed as a recurrent NN, and thus solving the optimization simply becomes passing the input through an unrolled NN [Hershey et al., 2014, Monga et al., 2019]. At one end of the spectrum, this perspective, through weight-tying, leads to architectures with significantly fewer parameters than a generic NN. At the other end, by untying the weights, it motivates new architectures that depart, and could not be arrived at, from the generative perspective [Gregor and Lecun, 2010, Sreter and Giryes, 2018].

The majority of the literature at the intersection of generative models and NNs assumes that the data are real-valued and therefore are not appropriate for binary or count-valued data, such as neural spiking data and photon-based images [Yang et al., 2011]. The exception is Poisson image denoising. Nevertheless, several works on Poisson image denoising, arguably the most popular application involving non real-valued data, can be found separately in both communities. In the SP community, the Poisson data likelihood is either explicitly minimized [Salmon et al., 2014, Giryes and Elad, 2014] or used as a penalty term added to the objective of an image denoising problem with Gaussian noise [Ma et al., 2013]. Being rooted in the dictionary learning formalism, these methods operate in an unsupervised manner. Although they yield good denoising performance, their main drawbacks are scalability and computational efficiency.

In the deep learning community, NNs tailored to image denoising [Zhang et al., 2017, Remez et al., 2018], which are reminiscent of residual learning, have shown great performance on Poisson image denoising. However, as these 1) are not designed from the generative model perspective and 2) are supervised learning frameworks, it is unclear how the architecture can be adapted to the classical CDL problem, where the task is unsupervised and the interpretability of the parameters is important. NNs with a generative flavor, namely variational auto-encoders (VAEs), have been extended to utilize non real-valued data [Nazábal et al., 2018, Liang et al., 2018]. However, these architectures cannot be adapted to solve the CDL task.

To address this gap, we make the following contributions:
Auto-encoder inspired by CDL for non real-valued data We introduce a flexible class of auto-encoder (AE) architectures for data from the natural exponential-family that combines the perspectives of generative models and NNs. We term this framework, depicted in Fig. 1, the deep convolutional exponential-family auto-encoder (DCEA).
Unsupervised learning of convolutional patterns We show through simulation that DCEA performs CDL and learns convolutional patterns from binomial observations. We also apply DCEA to real neural spiking data and show that it fits the data better than baselines.
Gradient dynamics of shallow exponential auto-encoder Given some assumptions on the binomial generative model with dense dictionary and “good” initializations, we prove in Theorem 4.1 that shallow exponential auto-encoder (SEA), when trained by gradient descent, recover the dictionary.
Supervised learning framework DCEA, when trained in a supervised manner, achieves similar performance to state-of-the-art algorithms for Poisson image denoising with orders of magnitude fewer parameters compared to baselines, owing to its design based on a generative model.

## 2 Problem formulation

Natural exponential-family distribution For a given observation vector , with mean , we define the log-likelihood of the natural exponential family [McCullagh and Nelder, 1989] as

 logp(y|μ)=f(μ)Ty+g(y)−B(μ), (1)

where we have assumed that, conditioned on , the elements of are independent. The natural exponential family includes a broad family of probability distributions such as the Gaussian, binomial, and Poisson. The functions , , as well as the invertible link function , all depend on the choice of distribution.

Convolutional generative model We assume that is the sum of scaled and time-shifted copies of finite-length filters (dictionary) , each localized, i.e., . We can express in a convolutional form: , where is the convolution operation, and is a train of scaled impulses which we refer to as code vector. Using linear-algebraic notation, , where is a matrix that is the concatenation of Toeplitz (i.e., banded circulant) matrices , , and .

We refer to the input/output domain of as the data and dictionary domains, respectively. We interpret as a time-series and the non-zero elements of as the times when each of the filters are active. When is two-dimensional (2D), i.e., an image, encodes the spatial locations where the filters contribute to its mean .

Exponential Convolutional Dictionary Learning (ECDL) Given observations , we estimate and that minimize the negative log likelihood under the convolutional generative model, subject to sparsity constraints on . We enforce sparsity using the norm, which leads to the non-convex optimization problem

 min{hc}Cc=1{xj}Jj=1J∑j=1l(xj)−(Hxj)Tyj+B(f−1(Hxj))+λ∥xj∥1, (2)

where the regularizer controls the degree of sparsity. A popular approach to deal with the non-convexity is to minimize the objective over one set of variables, while the others are fixed, in an alternating manner, until convergence [Agarwal et al., 2016]. When is being optimized with fixed , we refer to the problem as convolutional sparse coding (CSC). When is being optimized with fixed, we refer to the problem as convolutional dictionary update (CDU).

## 3 Deep convolutional exponential-family auto-encoder

We propose a class of auto-encoder architectures to solve the ECDL problem, which we term deep convolutional exponential-family auto-encoder (DCEA). Specifically, we make a one-to-one connection between the CSC/CDU steps and the encoder/decoder of DCEA depicted in Fig. 1. We focus only on CSC for a single , as the CSC step can be parallelized across examples.

### 3.1 The architecture

Encoder The forward pass of the encoder maps inputs into sparse codes . Given the filters , the encoder solves the -regularized optimization problem from Eq. (2)

 minxjl(xj)+λ∥xj∥1 (3)

in an iterative manner by unfolding iterations of the proximal gradient algorithm [Parikh and Boyd, 2014]. For Gaussian observations, Eq. (3) becomes an -regularized least squares problem, for which several works have unfolded the proximal iteration into a recurrent network [Gregor and Lecun, 2010, Wang et al., 2015, Sreter and Giryes, 2018, Tolooshams et al., 2018].

We use to denote a proximal operator with bias . We consider three operators,

 ReLUb(z) =(z−b)⋅\mathbbm1z≥b (4) HTb(z) =z⋅\mathbbm1|z|≥b Shrinkageb(z) =(z−b)⋅\mathbbm1z≥b+(z+b)⋅\mathbbm1z≤−b,

where is an indicator function. If we constrain the entries of to be non-negative, we use . Otherwise, we use or hard-threshold (HT), . A single iteration of the proximal gradient step is given by

 xjt (5) =Pb(xjt−1+αHT(y−f−1(Hxjt−1)˜yjt)),

where denotes the sparse code after iterations of unfolding, and is the step size of the gradient update. The term is referred to as working observation. The choice of , which we explore next, depends on the generative distribution. We also note that there is a one-to-one mapping between the regularization parameter and the bias of . We treat , and therefore , as hyperparameters that we tune to the desired sparsity level. The matrix effectively computes the correlation between and . Assuming that we unfold times, the output of the encoder is .

The architecture consists of two nonlinear activation functions: to enforce sparsity, and , the inverse of the link function. For Gaussian observations is linear with slope . For other distributions in the natural exponential family, the encoder uses , a mapping from the dictionary domain to the data domain, to transform the input at each iteration into a working observation .

Decoder & Training We apply the decoder to to obtain the linear predictor . This decoder completes the forward pass of the DCEA, also summarized in Algorithm 2 of the Appendix. We use the negative log-likelihood as the loss function applied to the decoder output for updating the dictionary. We train the weights of DCEA, fully specified by the filters , by gradient descent through backpropagation. Note that the penalty is not a function of and is not in the loss function.

### 3.2 Binomial and Poisson generative models

We focus on two representative distributions for the natural exponential family: binomial and Poisson. For the binomial distribution, assumes integer values from to . For the Poisson distribution, can in principle be any non-negative integer values, although this is rare due to the exponential decay of the likelihood for higher-valued observations. Table 1 summarizes the relevant parameters for these distributions.

The fact that binomial and Poisson observations are integer-valued and have limited range, whereas the underlying is real-valued, makes the ECDL challenging. This is compounded by the nonlinearity of , which distorts the error in the data domain, when mapped to the dictionary domain. In comparison, in Gaussian CDL 1) the observations are real-valued and 2) is linear.

This implies that, for successful ECDL, needs to assume a diverse set of integer values. For the binomial distribution, this suggests that should be large. For Poisson, as well as binomial, the maximum of should also be large. This explains why the performance is generally lower in Poisson image denoising for a lower peak, where the peak is defined as the maximum value of  [Giryes and Elad, 2014].

Practical design considerations for architecture As the encoder of DCEA performs iterative proximal gradient steps, we need to ensure that converges. Convergence analysis of ISTA [Beck and Teboulle, 2009] shows that if is convex and has –Lipschitz continuous gradient, which loosely means the Hessian is upper-bounded everywhere by , the choice of guarantees convergence. For the Gaussian distribution, is the largest singular value of  [Daubechies et al., 2004], denoted , and therefore . For the Binomial distribution, , and therefore .

The gradient of the Poisson likelihood is not Lipschitz continuous. Therefore, we cannot set a priori. In practice, the step size at every iteration is determined through back-tracking line search [Boyd and Vandenberghe, 2004], a process which is not trivial to replicate in the forward pass of a NN. To circumvent this problem, instead of adjusting , we enforce an upper bound on to stabilize the forward pass of the encoder. Indeed, we can see that

 ∥HT˜yjt∥2≤∥H∥2∥˜yjt∥2=σ2max(H)∥˜yjt∥2. (6)

Hence, upper bounding is sufficient. In this regard, given , the term can become unbounded from below, so we lower bound it by passing it through an exponential linear unit (Elu) [Clevert et al., 2016].

## 4 Connection to unsupervised/supervised paradigm

We now analyze the connection between DCEA and ECDL. We first examine how the convolutional generative model places constraints on DCEA. Next, we prove that under some assumptions, when training the shallow exponential-family auto-encoder by approximate gradient descent, the network recovers the dictionary corresponding to the binomial generative model. Finally, we explain how constrained DCEA can be modified for a supervised task.

### 4.1 Constrained structure of DCEA

We discuss key points that allow DCEA to perform ECDL.

- Linear 1-layer decoder In ECDL, the only sensible decoder is a one layer decoder comprising and a linear activation. In contrast, the decoder of a typical AE consists of multiple layers along with nonlinear activations.
- Tied weights across encoder and decoder Although our encoder is deep, the same weights ( and ) are repeated across the layers.
- Alternating minimization The forward pass through the encoder performs the CSC step and the backward pass, via backpropagation, performs the CDU step.

### 4.2 Theory for unsupervised CDL task

Consider a SEA, where the encoder is unfolded only once. Recent work [Nguyen et al., 2019] has shown that, using observations from a sparse linear model with a dense dictionary, the Gaussian SEA trained by approximate gradient descent recovers the dictionary. We prove a similar result for the binomial SEA trained with binomial observations with mean , i.e., a sparse nonlinear generative model with a dense dictionary . We assume the following (see Appendix for (A1) - (A12)):

• is -sparse and where each element of S is chosen uniformly at random without replacement.

• where , and .

• has symmetric probability density function, and and where .

• From the forward pass of the encoder, with high probability.

###### Theorem 4.1.

Suppose the generative model satisfies (A1) - (A12). Given infinitely many examples (i.e., ), the binomial SEA with trained by approximate gradient descent followed by normalization using the learning rate of (i.e., ) recovers . More formally, there exists such that at every iteration , .

Theorem 4.1 shows recovery of for the cases when grows faster than and the amplitude of the codes are bounded on the order of . We refer the reader to the Appendix for the implications of this theorem and its interpretation.

### 4.3 Supervised framework

For the supervised paradigm, given the desired output (i.e., clean image, , in the case of image denoising), we relax the DCEA architecture and untie the weights [Sreter and Giryes, 2018, Simon and Elad, 2019] as follows

 xjt=Pb(xjt−1+α(We)T(y−f−1(Wdxjt−1))), (7)

where we still use as the decoder. We use and to denote the filters that associated with and , respectively. Moreover, we train the bias , in contrast to the unsupervised setting. Compared to the DCEA for ECDL, the number of parameters to learn has increased roughly three-fold.

Although the introduction of additional parameters implies the framework is no longer exactly optimizing the parameters of the convolutional generative model, DCEA still maintains the core principles of the convolutional generative model. First, DCEA performs CSC, as are convolutional matrices and ensures sparsity of . Second, the encoder uses , as specified by natural exponential family distributions. Therefore, we allow only a moderate departure from the generative model, to balance the problem formulation and the problem solving mechanism. Indeed, as we show in the Poisson image denoising of Section 5, the denoising performance for DCEA with untied weights is superior to that of DCEA with tied weights.

We note that the constraints can be relaxed further. For instance, 1) the proximal operator can be replaced by a deep NN [Mardani et al., 2018], 2) the inverse link function can be replaced by a NN [Gao et al., 2016], and 3) , , and can be untied across different iterations [Hershey et al., 2014]. These schemes would increase the number of parameters while allowing for more expressivity and improved performance. Nevertheless, as our goal is to maintain the connection to sparsity and the natural exponential family, while keeping the number of parameters small, we do not explore these possibilities in this work.

## 5 Experiments

We apply our framework in three different settings.

- Poisson image denoising (supervised) We evaluate the performance of supervised DCEA in Poisson image denoising and compare it to state-of-the-art algorithms.
- ECDL for simulation (unsupervised) We use simulations to examine how the unsupervised DCEA performs ECDL for binomial data. With access to ground-truth data, we evaluate the accuracy of the learned dictionary.
- ECDL for neural spiking data (unsupervised) Using neural spiking data collected from mice [Temereanca et al., 2008], we perform unsupervised ECDL using DCEA. As is common in the analysis of neural data [Truccolo et al., 2005], we assume a binomial generative model.

### 5.1 Denoising Poisson images

We evaluated the performance of DCEA on Poisson image denoising for various peaks. We used the peak signal-to-noise-ratio (PSNR) as the performance metric. DCEA is trained in a supervised manner on the PASCAL VOC image set [Everingham et al., 2012] containing training images. We used in the encoder, and minimized the loss between the clean image, , and its reconstruction, , such that . We used two test datasets: 1) Set12 (12 images) and 2) BSD68 (68 images) [Martin et al., 2001].

Methods We trained two versions of DCEA to assess whether relaxing the generative model, thus increasing the number of parameters, helps improve the performance: 1) DCEA constrained (DCEA-C), which uses as the convolutional filters and 2) DCEA unconstrained (DCEA-UC), which uses , , and , as suggested in Eq. (7). We used filters of size , where we used convolutions with strides of 7 and followed a similar approach to [Simon and Elad, 2019] to account for all shifts of the image when reconstructing. In terms of the number of parameters, DCEA-C has and DCEA-UC has , where the last terms refer to the bias . We also absorbed into (constrained case) or (unconstrained case).

We unfolded the encoder for iterations. We initialized the filters using draws from a standard Gaussian distribution and used the ADAM optimizer with an initial learning rate of , which is decreased by a factor of 0.8 every 25 epochs, and trained the network for 400 epochs. At every iteration, a random patch is cropped from and normalized to have the maximum value as the desired peak, which is then used as a for generating count-valued Poisson images.

We compared DCEA against the following baselines. For fair comparison, we do not use the binning strategy [Salmon et al., 2014] of these methods, as a pre-processing step.

- Sparse Poisson Dictionary Algorithm (SPDA) This is a patch-based dictionary learning framework [Giryes and Elad, 2014], using the Poisson generative model with the pseudo-norm to learn the dictionary in an unsupervised manner, for a given noisy image. SPDA uses 400 filters of length 400, which results in parameters.
- BM3D + VST BM3D is an image denoising algorithm based on a sparse representation in a transform domain, originally designed for Gaussian noise. This algorithm applies a variance-stabilizing transform (VST) to the Poisson images to make them closer to Gaussian-perturbed images [Makitalo and Foi, 2013].
- Class-agnostic denoising network (CA) This is a denoising residual NN for both Gaussian and Poisson images [Remez et al., 2018], trained in a supervised manner.

Results Table 2 shows that DCEA outperforms SPDA and BM3D + VST, and shows competitive performance against CA, with one order of magnitude fewer parameters. We summarize a few additional points from this experiment.
- SPDA vs. DCEA-UC DCEA-UC is significantly more efficient computationally compared to SPDA. SPDA takes several hours, or days in some cases, to denoise a single Poisson noisy image whereas, upon training, DCEA performs denoising in less than a second.
- CA vs. DCEA-UC DCEA-UC achieves competitive performance against CA, despite an order of magnitude difference in the number of parameters (K for CA vs. K for DCEA). We conjecture that given the same number of parameters, the performance of DCEA-UC would improve and outperform CA.
- DCEA-C vs. DCEA-UC We also observe that DCEA-UC achieves better performance than DCEA-C. As discussed in Section 4.3, the relaxation of the generative model, which allows for a three-fold increase in the number of parameters, helps improve the performance.

### 5.2 Application to simulated neural spiking data

#### Accuracy of ECDL for DCEA

We simulated time-series of neural spiking activity from neurons according to the binomial generative model. We used templates of length and, for each example , generated , where each filter appears twice uniformly random in time. Fig. 2(a) shows an example of two different means, , for . Given , we simulated two sets of binary time-series, each with , , one of which is used for training and the other for validation.

Methods For DCEA, we initialized the filters using draws from a standard Gaussian, tuned the regularization parameter (equivalently for ) manually, and trained using the unsupervised loss. We place non-negativity constraints on and thus use . For baseline, we developed and implemented a method which we refer to as binomial convolutional orthogonal matching pursuit (BCOMP). At present, there does not exist an optimization-based framework for ECDL. Existing dictionary learning methods for non-Gaussian data are patch-based [Lee et al., 2009, Giryes and Elad, 2014]. BCOMP combines efficient convolutional greedy pursuit [Mailhé et al., 2011] and binomial greedy pursuit [Lozano et al., 2011]. BCOMP solves Eq. (2), but uses instead of . For more details, we refer the reader to the Appendix.

Results Fig. 2(b) demonstrates that DCEA (green) is able to learn accurately. Letting denote the estimates, we quantify the error between a filter and its estimate using the standard measure [Agarwal et al., 2016] , for . Fig. 2(c) shows the error between the true and learned filters by DCEA, as a function of epochs. The fact that the learned and the true filters match demonstrates that DCEA is indeed performing ECDL. Finally, Fig. 2(d) shows the runtime for both DCEA (on GPU) and BCOMP (on CPU) on CSC task, as a function of number of groups , where DCEA is much faster. This shows that DCEA, due to 1) its simple implementation as an unrolled NN and 2) the ease with which the framework can be deployed to GPU, is an efficient/scalable alternative to optimization-based BCOMP.

#### Generative model relaxation for ECDL

Here, we examine whether DCEA with untied weights, which implies a departure from the original convolutional generative model, can still perform ECDL accurately. To this end, we repeat the experiment from Section 5.2.1 with DCEA-UC, whose parameters are , , and . Fig. 3 shows the learned filters, , , and for , along with the true filters. For visual clarity, we only show the learned filters for which the distance to the true filters are the closest, among , , and . We observe that none of them match the true filters. In fact, the error between the learned and the true filters are bigger than the initial error.

This is in sharp contrast to the results of DCEA-C (Fig. 2(b)), where . This shows that, to accurately perform ECDL, the NN architecture needs to be strictly constrained such that it optimizes the objective formulated from convolutional generative model.

### 5.3 Neural spiking data from barrel cortex

We now apply DCEA to neural spiking data from the barrel cortex of mice recorded in response to periodic whisker deflections [Temereanca et al., 2008]. The objective is to learn the features of whisker motion that modulate neural spiking strongly. In the experiment, a piezoelectrode controls whisker position using an ideal position waveform. As the interpretability of the learned filters is important, we constrain the weights of encoder and decoder to be . DECA lets us learn, in an unsupervised fashion, the features that best explains the data.

The dataset consists of neural spiking activity from neurons in response to periodic whisker deflections. Each example consists of trials lasting ms, i.e., . Fig. 4(a) depicts a segment of data from a neuron. Each trial begins/ends with a baseline period of ms. During the middle ms, a periodic deflection with period ms is applied to a whisker by the piezoelectrode. There are total deflections, five of which are shown in Fig. 4(b). The stimulus represents ideal whisker position. The blue curve in Fig. 4(c) depicts the whisker velocity obtained as the first difference of the stimulus.

Methods We compare DCEA to -based ECDL using BCOMP (introduced in the previous section), and a generalized linear model (GLM) [McCullagh and Nelder, 1989] with whisker-velocity covariate [Ba et al., 2014]. For all three methods, we let and , initialized using the whisker velocity (Fig. 4(c), blue). We set for DCEA and set the sparsity level of BCOMP to . As in the simulation, we used to ensure non-negativity of the codes. We used trials from each neuron to learn and the remaining trials as a test set to assess goodness-of-fit. We describe additional parameters used for DCEA and the post-processing steps in the Appendix.

Results The orange and green curves from Fig. 4(c) depict the estimates of whisker velocity computed from the neural spiking data using BCOMP and DCEA, respectively. The figure demonstrates that, within one period of whisker motion, whisker velocity, and therefore position, is similar to but deviates from the ideal motion programmed into the piezoelectric device. In this experiment, the desired motion was to move the whisker in one direction and then in the opposite direction. The presence of peaks around , and ms suggest that the whisker moved upwards multiple times, likely due to whisker motion with respect to the piezoelectrode. Fig. 4(d) depicts the sparse codes that accurately capture the onset of stimulus in each of the deflection periods. The heterogeneity of amplitudes estimated by DCEA and BCOMP is indicative of the variability of the stimulus component of neural response across deflections. This is in sharp contrast to the GLM–detailed in Appendix–which uses the ideal whisker velocity (Fig. 4(c), blue) as a covariate, and assumes that neural response to whisker deflections is constant across deflections.

In Fig. 4(e), we use the Kolmogorov-Smirnov (KS) test to compare how well the DCEA, BCOMP, and the GLM fit the data for a representative neuron in the dataset [Brown et al., 2002]. KS plots are a visualization of the KS test for assessing the Goodness-of-fit of models to point-process data, such as neural spiking data (see Appendix for details). The figure shows that the DCEA and BCOMP are a much better fit to the data than the GLM.

We emphasize that 1) the similarity of the learned and 2) the similar goodness-of-fit of DCEA and BCOMP to the data shows that DCEA performs ECDL. In addition, this analysis shows the power of the ECDL as an unsupervised and data-driven approach for data analysis, and a superior alternative to GLMs, where the features are hand-crafted.

## 6 Conclusion

We introduced a class of neural networks based on a generative model for convolutional dictionary learning (CDL) using data from the natural exponential-family, such as count-valued and binary data. The proposed class of networks, which we termed deep convolutional exponential auto-encoder (DCEA), is competitive compared to state-of-the-art supervised Poisson image denoising algorithms, with an order of magnitude fewer trainable parameters.

We analyzed gradient dynamics of shallow exponential-family auto-encoder (i.e., unfold the encoder once) for binomial distribution and proved that when trained with gradient descent, the network recovers the dictionary corresponding to the binomial generative model.

We also showed using binomial data simulated according to the convolutional exponential-family generative model that DCEA performs dictionary learning, in an unsupervised fashion, when the parameters of the encoder/decoder are constrained. The application of DCEA to neural spike data suggests that DCEA is superior to GLM analysis, which relies on hand-crafted covariates.

## Appendix A Gradient dynamics of shallow exponential auto-encoder (SEA)

###### Theorem A.1.

(informal). Given a “good” initial estimate of the dictionary from the binomial dictionary learning problem, and infinitely many examples, the binomial SEA, when trained by gradient descent through backpropagation, learns the dictionary.

###### Theorem A.2.

Suppose the generative model satisfies (A1) - (A12). Given infinitely many examples (i.e., ), the binomial SEA with trained by approximate gradient descent followed by normalization using the learning rate of (i.e., ) recovers . More formally, there exists such that at every iteration , .

In proof of the above theorem, our approach is similar to [Nguyen et al., 2019].

### a.1 Generative model and architecture

We have binomial observations where can be seen as sum of independent Bernoulli random variables (i.e., ). We can express , where is the inverse of the corresponding link function (sigmoid), is a matrix dictionary, and is a sparse vector. Hence, we have

 E[yj]=μ=eAx∗1+eAx∗=σ(Ax∗). (8)

In this analysis, we assume that there are infinitely many examples (i.e., ), hence, we use the expectation of the gradient for backpropagation at every iteration. We also assume that there are infinite number of Bernoulli observation for each binomial observation (i.e., ). Hence, from the Law of Large Numbers, we have the following convergence in probability

 limMj→∞1Mjyj=limMj→∞1MjMj∑m=1\mathbbm1jm=μ=σ(Ax∗), (9)

We drop for ease of notation. Algorithm 1 shows the architecture when the code is initialized to . are the weights of the auto-encoder. The encoder is unfolded only once and the step size of the proximal mapping is set to (i.e., assuming the maximum singular value of is , then is the largest step size to ensure convergence of the encoder as the first derivative of sigmoid is bounded by .

where , and at the first layer of the encoder is sigmoid of the initial code estimate (i.e., ). From the definition of the proximal operator, we can see that where is an indicator function.

### a.2 Assumptions and definitions

Given the following definition and notations,

• We say is -close to where we define it as if .

• We say is -near to if is -close to and .

• We say a unit-norm columns matrix is -incoherent if for every pair of columns, .

• We define that column of as .

• We say is -correlated to if . Hence, .

• From the binomial likelihood, the loss would be .

• We denote the expectation of the gradient of the loss defined in (D6) with respect to to be .

• denotes the matrix with column removed, and denotes excluding .

• denotes (i.e., the element fo the vector ).

• denotes the set , and denotes excluding .

• For , indicates a matrix with columns from the set . Similarly, for , indicates a vector containing only the elements with indices from .

we assume the generative model satisfies the following assumptions:

• Let the code be -sparse and have support S (i.e., ) where each element of is chosen uniformly at random without replacement from the set . Hence, and .

• Each code is bounded (i.e., ) where , and . Then .

• Given the support, we assume is i.i.d, zero-mean, and has symmetric probability density function. Hence, and where .

• From the forward pass of the encoder, with high probability. We call this code consistency, a similar definition from [Nguyen et al., 2019]. Hence, .

• We assume .

• Given , we have and .

• is -near ; thus, .

• is -incoherent.

• is -correlated to .

• Given (A7) and (A8), for any , we have .

• We assume the network is trained by approximate gradient descent followed by normalization using the learning rate of . Hence, the gradient update for column at iteration is . At the normalization step, , we enforce . Lemma 5 in [Nguyen et al., 2019] shows that descent property can also be achieved with the normalization step.

• We use the Taylor series of around . Hence, , where and denotes Hessian.

### a.3 Gradient derivation

First, we derive . In this derivation, by dominated convergence theorem, we interchange the limit and derivative. We also compute the limit inside as it is a continuous function.

 limM→∞∂LW∂wi=limM→∞∂c1∂wi∂LW∂c1+∂c2∂wi∂LW∂c2=∂c1∂wi∂x∂c1∂c2∂x LW∂c2+∂c2∂wi∂LW∂c2 (10)

We further expand the gradient, by replacing with its Taylor expansion. We have

 σ(Ax∗)=12+14Ax∗+ϵ, (11)

where , , and . Similarly,

 σ(4WSWTS(μ−12))=12+WSWTS(μ−12)+~ϵ, (12)

where , , and . Again, replacing with Taylor expansion of , we get

 σ(WSWTS(μ−12))=12+WSWTS(14Ax∗+ϵ)+~ϵ. (13)

By symmetry, . The expectation of gradient would be

 gi=E[(\mathbbm1xi≠0(Ax∗+4ϵ)wTi+\mathbbm1xi≠0wTi(Ax∗+4ϵ)I)(14(WSWTS−I)(Ax∗)+(WSWT% S−I)ϵ+~ϵ)]. (14)

### a.4 Gradient dynamics

Given the code consistency from the forward pass of the encoder, we replace with and denote the error by as below which is small for large  [Nguyen et al., 2019].

 γ=E[(\mathbbm1x∗i≠0−\mathbbm1xi≠0)((Ax∗+4ϵ)wTi+wTi(Ax∗+4ϵ)I)(14(WSWTS−I)(Ax∗)+(WSWT% S−I)ϵ+~ϵ)]. (15)

Now, we write as

 gi=E[\mathbbm1x∗i≠0((Ax∗+4ϵ)wTi+wTi(Ax∗+4ϵ)I)(14(WSWTS−I)(Ax∗)+(WSWTS−I)ϵ+~ϵ)]+γ. (16)

We can see that if then hence, . Thus, in our analysis, we only consider the case . We decompose as below.

 gi=g(1)i+g(2)i+g(3)i+γ, (17)

where

 g(1)i =E[14wTi(Ax∗+4ϵ)(WSWTS−I)Ax∗]. (18)
 g(2)i =E[14(Ax∗+4ϵ)wTi(WSWTS−I)Ax∗]. (19)
 g(3)i (20)

We define

 g(1)i,S =E[14wTi(Ax∗+4ϵ)(WSWTS−I)Ax∗∣S]. (21)
 g(2)i,S =E[14(Ax∗+4ϵ)wTi(WSWTS−I)Ax∗∣S]. (22)
 g(3)i,S (23)

Hence, , , and where the expectations are with respect to the support S. We compute next.

 g(1)i,S =E[14wTi(Ax∗+4ϵ)(WSWTS−I)Ax