Sparsely Activated Networks
Abstract
Previous literature on unsupervised learning focused on designing structural priors with the aim of learning meaningful features. However, this was done without considering the description length of the learned representations which is a direct and unbiased measure of the model complexity. In this paper, first we introduce the metric that evaluates unsupervised models based on their reconstruction accuracy and the degree of compression of their internal representations. We then present and define two activation functions (Identity, ReLU) as base of reference and three sparse activation functions (topk absolutes, ExtremaPool indices, Extrema) as candidate structures that minimize the previously defined . We lastly present Sparsely Activated Networks (SANs) that consist of kernels with shared weights that, during encoding, are convolved with the input and then passed through a sparse activation function. During decoding, the same weights are convolved with the sparse activation map and subsequently the partial reconstructions from each weight are summed to reconstruct the input. We compare SANs using the five previously defined activation functions on a variety of datasets (Physionet, UCIepilepsy, MNIST, FMNIST) and show that models that are selected using have small description representation length and consist of interpretable kernels.
keys_valueskeys_values.csv
I Introduction
Deep Neural Networks (DNNs) [25] use multiple stacked layers containing weights and activation functions, that transform the input to intermediate representations during the feedforward pass. Using backpropagation [35] the gradient of each weight w.r.t. the error of the output is efficiently calculated and passed to an optimization function such as Stochastic Gradient Descent or Adam [21] which updates the weights making the output of the network converge to the desired output. DNNs were successful in utilizing big data and powerful parallel processing units and achieved stateoftheart performance in problems such as image [23] and speech recognition [15]. However, these breakthroughs have come at the expense of increased description length of the learned representations, which in sparsely represented DNNs is proportional with the number of:

weights of the model and

nonzero activations.
The use of large number of weights as a design choice in architectures such as Inception [41], VGGnet [37] and ResNet [17] (usually by increasing the depth) was followed by research that signified the weight redundancy of DNNs. It was demonstrated that DNNs easily fit random labeling of the data [45] and that in any DNN there exists a subnetwork that can solve the given problem with the same accuracy with the original one [11].
Moreover, DNNs with large number of weights have higher storage requirements and they are slower during inference. Previous literature addressing this problem has focused on weight pruning from trained DNNs [1] and weight pruning during training [27]. Pruning minimizes the model capacity for use in environments with low computational capabilities, or low inference time requirements and helps reducing coadaptation between neurons, a problem which was also addressed by the use Dropout [40]. Pruning strategies however, only take in consideration the number of weights of the model.
The other element that affects the description length of the representation of DNNs is the number of nonzero activations in the intermediate representations which is related with the concept of activity sparseness. In neural networks sparseness can be applied on the connections between neurons, or in the activation maps [24]. Although sparseness in the activation maps is usually enforced in the loss function by adding a regularization or KullbackLeibler divergence term [22], we could also achieve sparsity in the activation maps with the use of an appropriate activation function.
Initially, bounded functions such as and were used, however besides producing dense activation maps they also present the vanishing gradients problem [5]. Rectified Linear Units (ReLUs) were later proposed [12, 29] as an activation function that solves the vanishing gradients problem and increases the sparsity of the activation maps. Although ReLU creates exact zeros (unlike its predecessors and ), its activation map consists of sparsely separated but still dense areas (Fig. 1LABEL:sub@subfig:relu) instead of sparse spikes. The same applies for generalizations of ReLU, such as Parametric ReLU [16] and Maxout [14]. Recently, in Sparse Autoencoders [28] the authors used an activation function that applies thresholding until the most active activations remain, however this nonlinearity covers a limited area of the activation map by creating sparsely disconnected dense areas (Fig. 1LABEL:sub@subfig:topkabsolutes), similar to the ReLU case.
Moreover activation functions that produce continuous valued activation maps (such as ReLU) are less biologically plausible, because biological neurons rarely are in their maximum saturation regime [9] and use spikes to communicate instead of continuous values [4]. Previous literature has also demonstrated the increased biological plausibility of sparseness in artificial neural networks [33]. Spikelike sparsity on activation maps has been thoroughly researched on the more biologically plausible ratebased network models [18], but it has not been thoroughly explored as a design option for activation functions combined with convolutional filters.
The increased number of weights and nonzero activations make DNNs more complex, and thus more difficult to use in problems that require corresponding causality of the output with a specific set of neurons. The majority of domains where machine learning is applied, including critical areas such as healthcare [6], require models to be interpretable and explainable before considering them as a solution. Although these properties can be increased using sensitivity analysis [36], deconvolution methods [43], LayerwiseRelevance Propagation [3] and LocalInterpretable Model agnostic Explanations [34] it would be preferable to have selfinterpretable models.
Moreover, considering that DNNs learn to represent data using the combined set of trainable weights and nonzero activations during the feedforward pass, an interesting question arises:
What are the implications of trading off the reconstruction error of the representations with their compression ratio w.r.t to the original data?
Previous work by Blier et al. [7] demonstrated the ability of DNNs to losslessly compress the input data and the weights, but without considering the number of nonzero activations. In this work we relax the lossless requirement and also consider neural networks purely as function approximators instead of probabilist models. The contributions of this paper are the following proposals:

The metric that evaluates unsupervised models based on how compressed their learned representations are w.r.t the original data and how accurate their reconstruction is.
In section II we define the metric, then in section III we define the five tested activation functions along with the architecture and training procedure of SANs, in section IV we experiment SANs on the Physionet [13], UCIepilepsy [2], MNIST [26] and FMNIST [42] databases and provide visualizations of the intermediate representations and results. In section V we discuss the findings of the experiments and the limitations of SANs and finally in section VI we present the concluding remarks and the future work.
Ii Metric
Let be a model with kernels each one with samples and a reconstruction loss function for which we have:
(1) 
, where is the input vector and is a reconstruction of . For the definition of metric we use a neural network model that consists of convolutional filters, however this can be easily generalized to other architectures.
The metric evaluates a model based on two concepts: its verbosity and its accuracy. Verbosity in neural networks can be perceived as inversely proportional to the compression ratio of the representations. We calculate the number of weights of as follows:
(2) 
We also calculate the number of nonzero activations of for input as:
(3) 
, where denotes the pseudonorm and is the activation map of the kernel. Then using Equations 2 and 3 we define the compression ratio of w.r.t as:
(4) 
, where denotes dimensionality and was previously defined as the cardinality of . The reason that we multiply the dimensionality of with the number of activations is that we need to consider the spatial position of each nonzero activation in addition to its amplitude to reconstruct . Moreover, using that definition of , there is a desirable tradeoff between using a larger kernel with less instances and a smaller kernel with more instances based on which kernel size minimizes the . This definition of allows us to set a base of reference for for models that their representational capacity is equal with the number of samples of the input.
Regarding the accuracy, we define the normalized reconstruction loss as follows:
(5) 
This definition of allows us to set a base of reference for in cases when the reconstruction the model performs is equivalent with a model that performs constant reconstructions independent of the input.
Finally using Equations 4 and 5 we define the
(6) 
, where denotes the euclideannorm. The rationale behind defining is to satisfy the need for a unified metric that takes in consideration both the ‘verbosity’ of a model along with its ‘accuracy’.
Regarding hyperparameter selection we also define the mean of a dataset or a minibatch w.r.t to as:
(7) 
, where is the number of observations in the dataset or the batch size.
is nondifferentiable due to the presence of the pseudonorm in Eq. 3. A way to overcome this is using as the differentiable optimization function during training and as the metric for model selection during validation on which hyperparameter value decisions (such as kernel size) are made.
Iii Sparsely Activated Networks
Iiia Sparse Activation Functions
In this subsection we define five activation functions and their corresponding sparsity density parameter for which we have:
(8) 
We choose values for for each activation function in such as way, to approximately have the same number of activations for fair comparison of the sparse activation functions.
Identity
. The Identity activation function serves as a baseline and does not change its input as shown in Fig. 1LABEL:sub@subfig:identity. For this case does not apply.
ReLU
. The ReLU activation function produces sparsely disconnected but internally dense areas as shown in Fig. 1LABEL:sub@subfig:relu instead of sparse spikes. For this case does not apply.
topk absolutes
The topk absolutes function (defined at Algorithm 1) keeps the indices of activations with the largest absolute value and zeros out the rest, where . We set , where . Topk absolutes is sparser than ReLU but some extrema are overactivated related to some others that are not activated at all, as shown in Fig. 1LABEL:sub@subfig:topkabsolutes.
ExtremaPool indices
The ExtremaPool indices activation function (defined at Algorithm 2) keeps only the index of the activation with the maximum absolute amplitude from each region outlined by a grid as granular as the kernel size and zeros out the rest. It consists of a maxpooling layer followed by a maxunpooling layer with the same parameters while the sparsity parameter in this case is set . This activation function creates sparser activation maps than topk absolutes however in cases where the pool grid is near a peak or valley this region is activated twice (as shown in Fig. 1LABEL:sub@subfig:extremapoolindices).
Extrema
The Extrema activation function (defined at Algorithm 3) detects candidate extrema using zero crossing of the first derivative, then sorts them in an descending order and gradually eliminates those extrema that have less absolute amplitude than a neighboring extrema within a predefined minimum extrema distance (). Imposing a on the extrema detection algorithm makes sparser than the previous cases and solves the problem of double extrema activations that ExtremaPool indices have (as shown in Fig. 1LABEL:sub@subfig:extrema). The sparsity parameter in this case is set , where is the minimum extrema distance. We set for utilizing fair comparison between the sparse activation functions. Specifically for Extrema activation function we introduce a ‘border tolerance’ parameter to allow neuron activation within another neuron activated area.
IiiB SAN Architecture/Training
Let be a single input data, however the following can be trivially generalized for batch inputs with different cardinalities. Let the weight matrix of the kernel, that are initialized using a normal distribution with mean and standard deviation :
(9) 
, where is the number of kernels.
First we calculate the similarity matrices
(10) 
, where is the convolution
We then pass and a sparsity parameter in the sparse activation function resulting in the activation map :
(11) 
, where is a sparse matrix that its nonzero elements denote the spatial positions of the instances of the kernel. The exact form of and depends on the choice of the sparse activation function which are presented in section IIIA.
We convolve each with its corresponding resulting in the set of which are partial reconstructions of the input:
(12) 
, consisting of sparsely reoccurring patterns of with varying amplitude. Finally we can reconstruct the input as the sum of the partial reconstructions as follows:
(13) 
The Mean Absolute Error (MAE) of the input and the prediction is then calculated:
(14) 
, where the index denotes the sample. The choice of MAE is based on the need to handle outliers in the data with the same weight as normal values. However, SANs are not restricted in using MAE and other loss functions could be used, such as Mean Square Error.
Using backpropagation, the gradients of the loss w.r.t the are calculated:
(15) 
Lastly the are updated using the following learning rule:
(16) 
, where is the learning rate.
After training, we consider (which is calculated during the feedforward pass from Eq. 11) and (which is calculated using backpropagation from Eq. 16) the compressed representation of , which can be reconstructed using Equations 12 and 13:
(17) 
Regarding the metric and considering Eq. 17 our target is to estimate an as accurate as possible representation of through and with the least amount of number of nonzero activations and weights.
The general training procedure of SANs for multiple epochs using batches (instead of one example as previously shown) is presented in Algorithm 4.
Iv Experiments
For all experiments the weights of the SAN kernels are initialized using the normal distribution with and . We used Adam [21] as the optimizer with learning rate , , , epsilon without weight decay. For implementing and training SANs we used Pytorch [32] with a NVIDIA Titan X Pascal GPU 12GB RAM and a 12 Core Intel i78700 CPU @ 3.20GHz on a Linux operating system.
Iva Comparison of for sparse activation functions and various kernel sizes in Physionet
We study the effect on , of the combined choice of the kernel size and the sparse activation functions that were defined in section IIIA.
Datasets
We use one signal from each of 15 signal datasets from Physionet listed in the first column of Table I. Each signal consists of samples which in turn is split in signals of samples each, to create the training ( signals), validation ( signals) and test datasets ( signals). The only preprocessing that is done is mean subtraction and division of one standard deviation on the samples signals.
Experiment Setup
We train five SANs (one for each sparse activation function) for each of the Physionet datasets, for epochs with a batch size of with one kernel of varying size in the range . During validation we selected the models with the kernel size that achieved the best out of all epochs. During testing we feed the test data into the selected model and calculate , and for this set of hyperparameters as shown in Table I. For Extrema activation we set a ‘border tolerance’ of three samples.
Results
The three separate clusters which are depicted in Fig. 3 and the aggregated density plot in Fig. 4LABEL:sub@subfig:crrl_density_plot between the Identity activation function, the ReLU and the rest show the effect of a sparser activation function on the representation. The sparser an activation function is the more it compresses, sometimes at the expense of reconstruction error. However, by visual inspection of Fig. 5 one could confirm that the learned kernels of the SAN with sparser activation maps (ExtremaPool indices and Extrema) correspond to the reoccurring patterns in the datasets, thus having high interpretability. These results suggest that reconstruction error by itself is not a sufficient metric for decomposing data in interpretable components. Trying to solely achieve lower reconstruction error (such as the case for the Identity activation function) produces noisy learned kernels, while using the combined measure of reconstruction error and compression ratio (smaller ) results in interpretable kernels. Comparing the differences of between the Identity, the ReLU and the rest sparse activation functions in Fig. 4LABEL:sub@subfig:flithos_m we notice that the latter produce a minimum region in which we observe interpretable kernels.
Identity  ReLU  topk absolutes  ExtremaPool idx  Extrema  
Datasets  
apneaecg  1  2.00  0.03  2.00  19  0.70  0.53  0.87  74  0.10  0.37  0.39  51  0.09  0.47  0.48  72  0.10  0.31  0.32 
bidmc  1  2.00  0.04  2.00  4  0.82  0.50  0.96  5  0.41  0.64  0.76  10  0.21  0.24  0.32  113  0.13  0.30  0.32 
bpssrat  1  2.00  0.02  2.00  1  0.85  0.51  0.99  10  0.21  0.63  0.67  8  0.26  0.45  0.52  8  0.24  0.30  0.38 
cebsdb  1  2.00  0.01  2.00  3  0.95  0.51  1.07  5  0.41  0.62  0.74  12  0.18  0.21  0.28  71  0.09  0.45  0.46 
ctuuhbctgdb  1  2.00  0.01  2.00  1  0.48  0.51  0.71  7  0.29  0.60  0.66  9  0.23  0.44  0.49  45  0.07  0.57  0.57 
drivedb  1  2.00  0.04  2.00  20  0.51  0.54  0.74  20  0.12  0.67  0.68  13  0.17  0.69  0.71  19  0.10  0.72  0.73 
emgdb  1  2.00  0.04  2.00  1  0.94  0.50  1.07  7  0.29  0.62  0.68  9  0.23  0.48  0.53  7  0.15  0.51  0.53 
mitdb  1  2.00  0.03  2.00  61  0.78  0.49  0.92  7  0.29  0.52  0.59  10  0.21  0.44  0.49  229  0.24  0.38  0.45 
noneeg  1  2.00  0.01  2.00  6  0.91  0.57  1.08  4  0.50  0.59  0.77  37  0.09  0.49  0.50  15  0.12  0.36  0.38 
prcp  1  2.00  0.03  2.00  1  1.00  0.51  1.12  5  0.41  0.59  0.71  23  0.11  0.41  0.42  105  0.12  0.42  0.44 
shhpsgdb  1  2.00  0.02  2.00  4  0.85  0.60  1.05  6  0.34  0.69  0.77  7  0.29  0.42  0.51  15  0.10  0.53  0.54 
slpdb  1  2.00  0.03  2.00  7  0.72  0.53  0.90  7  0.29  0.52  0.60  232  0.24  0.29  0.37  218  0.23  0.36  0.43 
sufhsdb  1  2.00  0.03  2.00  38  1.02  0.24  1.05  5  0.41  0.55  0.68  18  0.13  0.36  0.39  17  0.12  0.26  0.28 
voiced  1  2.00  0.01  2.00  41  0.95  0.26  0.98  36  0.09  0.56  0.57  70  0.10  0.41  0.43  67  0.10  0.41  0.43 
wrist  1  2.00  0.04  2.00  56  0.74  0.62  0.96  5  0.41  0.49  0.63  9  0.23  0.43  0.49  173  0.18  0.46  0.50 
IvB Evaluation of the reconstruction of SANs using a Supervised CNN on UCIepilepsy
We study the quality of the reconstructions of SANs by training a supervised 1D Convolutional Neural Network (CNN) on the output of each SAN. We also study the effect that has on and the accuracy of the classifier for the five sparse activation functions.
Dataset
We use the UCIepilepsy recognition dataset that consists of signals each one with samples (23.5 seconds).
The dataset is annotated into five classes with signals for each class.
For the purposes of this paper we use a variation of the database
Experiment Setup
First, we merge the tumor classes ( and ) and the eyes classes ( and ) resulting in a modified dataset of three classes (tumor, eyes, epilepsy). We then split the signals into , and ( signals) as training, validation and test data respectively and normalize in the range using the global max and min. For the SANs, we used two kernels with a varying size in the range of and trained for epochs with a batch size of . After training, we choose the model that performed the lowest out of all epochs.
During supervised learning the weights of the kernels are frozen and a CNN is stacked on top of the reconstruction output of the SANs. The CNN feature extractor consists of two convolutional layers with and filters and kernel size , each one followed by a ReLU and a MaxPool with pool size . The classifier consists of three fully connected layers with , and units. The first two fully connected layers are followed by a ReLU while the last one produces the predictions. The CNN is trained for an additional epochs with the same batch size and model selection procedure as with SANs and categorical crossentropy as the loss function. For Extrema activation we set a ‘border tolerance’ of two samples.
Results
As shown in Table. II, although we use a significantly reduced representation size, the classification accuracy differences (A\textsubscript%) are retained which suggests that SANs choose the most important features to represent the data. For example for for the Extrema activation function, there is an increase in accuracy of (the baseline CNN on the original data achieved ) although a reduced representation is used with only of the size w.r.t the original data.
Identity  ReLU  topk absolutes  ExtremaPool idx  Extrema  
A\textsubscript%  A\textsubscript%  A\textsubscript%  A\textsubscript%  A\textsubscript%  
8  4.09  0.04  4.09  +5.6  2.09  0.02  2.09  +4.3  0.58  0.73  0.93  0.4  0.58  0.41  0.71  7.8  0.37  0.45  0.59  +2.8 
9  4.10  0.03  4.10  +1.4  2.10  0.03  2.10  44.3  0.53  0.74  0.91  1.2  0.53  0.41  0.67  2.6  0.35  0.42  0.56  +0.8 
10  4.11  0.02  4.11  +8.0  3.61  0.11  3.62  26.6  0.49  0.76  0.91  13.4  0.49  0.43  0.65  +1.7  0.35  0.42  0.56  +2.4 
11  4.12  0.02  4.12  44.3  2.12  0.03  2.13  +6.2  0.48  0.76  0.90  12.2  0.48  0.41  0.63  1.2  0.34  0.41  0.54  +1.0 
12  4.13  0.02  4.13  +6.6  2.13  0.03  2.13  +2.1  0.45  0.77  0.89  5.7  0.45  0.44  0.63  1.9  0.34  0.42  0.55  +3.0 
13  4.15  0.06  4.15  +4.1  2.15  0.03  2.15  +6.5  0.44  0.78  0.89  13.5  0.44  0.44  0.62  9.3  0.34  0.42  0.55  +0.8 
14  4.16  0.06  4.16  +6.2  3.66  0.11  3.66  44.3  0.43  0.78  0.89  10.4  0.43  0.46  0.63  2.1  0.34  0.43  0.55  1.4 
15  4.17  0.03  4.17  +9.1  2.17  0.03  2.17  8.4  0.42  0.78  0.89  3.5  0.42  0.46  0.62  +1.8  0.34  0.43  0.55  2.0 
IvC Evaluation of the reconstruction of SANs using a Supervised FNN on MNIST and FMNIST
Dataset
For the same task as the previous one but for 2D, we use MNIST which consists of a training dataset of greyscale images with handwritten digits and a test dataset of images each one having size of . The exact same procedure is used for FMNIST [42].
Experiment Setup
The models consist of two kernels with a varying size in the range of . We use images from the training dataset as a validation dataset and train on the rest for epochs with a batch size of . We do not apply any preprocessing on the images.
During supervised learning the weights of the kernels are frozen and a one layer fully connected network (FNN) is stacked on top of the reconstruction output of the SANs. The FNN is trained for an additional epochs with the same batch size and model selection procedure as with SANs and categorical crossentropy as the loss function. For Extrema activation we set a ‘border tolerance’ of two samples.
Results
As shown in Table. III the accuracies achieved by the reconstructions of some SANs do not drop proportionally compared to those of an FNN trained on the original data (), although they have been heavily compressed. It is interesting to note that in some cases SANs reconstructions, such as for the ExtremaPool indices, performed even better than the original data. This suggests the overwhelming presence of redundant information that resides in the raw pixels of the original data and further indicates that SANs extract the most representative features of the data.
Identity  ReLU  topk absolutes  ExtremaPool idx  Extrema  
A\textsubscript%  A\textsubscript%  A\textsubscript%  A\textsubscript%  A\textsubscript%  
1  1.16  0.00  1.16  0.1  1.16  0.01  1.16  1.6  1.16  0.00  1.16  0.3  1.16  0.00  1.16  0.1  0.09  0.89  0.90  11.3 
2  1.55  0.01  1.55  0.3  0.78  0.01  0.78  0.0  1.37  0.02  1.37  0.6  0.48  0.61  0.79  +1.3  0.09  0.83  0.83  7.6 
3  1.93  0.00  1.93  0.5  1.16  0.03  1.16  0.8  0.63  0.25  0.68  1.6  0.30  0.51  0.59  +0.0  0.08  0.50  0.51  7.3 
4  2.30  0.03  2.30  0.5  1.53  0.02  1.53  0.6  0.39  0.40  0.55  3.7  0.22  0.59  0.63  0.6  0.07  0.56  0.57  10.3 
5  2.66  0.09  2.66  0.8  0.73  0.03  0.73  0.2  0.20  0.55  0.59  5.3  0.16  0.60  0.62  0.7  0.07  0.57  0.57  8.8 
6  3.02  0.06  3.02  0.6  0.60  0.01  0.60  0.5  0.14  0.61  0.63  8.8  0.12  0.63  0.65  1.7  0.05  0.60  0.61  11.4 
Identity  ReLU  topk absolutes  ExtremaPool idx  Extrema  
A\textsubscript%  A\textsubscript%  A\textsubscript%  A\textsubscript%  A\textsubscript%  
1  3.00  0.01  3.00  1.7  1.50  0.00  1.50  +1.0  3.00  0.01  3.00  0.5  3.00  0.01  3.00  3.8  0.35  0.86  0.93  4.0 
2  3.58  0.01  3.58  +2.0  3.00  0.05  3.00  +1.3  1.50  0.31  1.55  5.0  0.99  0.58  1.16  0.6  0.23  0.81  0.84  6.6 
3  3.94  0.05  3.94  +2.0  2.01  0.02  2.01  +1.0  0.63  0.63  0.89  7.0  0.48  0.52  0.72  0.7  0.16  0.65  0.68  9.3 
4  4.22  0.06  4.22  0.1  0.01  1.00  1.00  71.5  0.39  0.69  0.79  9.4  0.32  0.61  0.70  1.9  0.09  0.70  0.71  9.6 
5  4.45  0.04  4.45  +1.8  1.92  0.02  1.92  +1.3  0.20  0.75  0.78  15.0  0.19  0.60  0.63  5.0  0.07  0.64  0.64  12.2 
6  4.70  0.04  4.70  +0.8  2.80  0.05  2.80  +2.2  0.14  0.77  0.78  19.6  0.13  0.66  0.67  4.3  0.05  0.69  0.70  13.7 
V Discussion
SANs combined with the metric compress the description of the data in a way a minimum description language framework would, by encoding them into and . The experiments done in Section IV show that the use of Identity, ReLU and (in a lesser degree) topk absolutes produce noisy features, while on the other hand ExtremaPool indices and Extrema produce more robust features and can be configured with parameters (kernel size and ) with values that can be derived by simple visual inspection of the data.
From the point of view of Sparse Dictionary Learning, SANs kernels could be seen as the atoms of a learned dictionary specializing in interpretable pattern matching (e.g. for Electrocardiogram (ECG) input the kernels of SANs are ECG beats) and the sparse activation map as the representation. The fact that SANs are wide with fewer larger kernel sizes instead of deep with smaller kernel sizes make them more interpretable than the DNNs and in some cases without sacrificing significant accuracy.
An advantage of SANs compared to Sparse Autoencoders [30] is that the constrain of activation proximity can be applied individually in each example instead of requiring the computation of forwardpass of all examples. Additionally, SANs create exact zeros instead nearzeros, which reduces coadaptation between instances of the neuron activation.
could be seen as an alternative formalization of Occam’s razor [38] to Solomonov’s theory of inductive inference [39] but with a deterministic interpretation instead of a probabilistic one. The cost of the description of the data could be seen as proportional to the number of weights and the number of nonzero activations, while the quality of the description is proportional to the reconstruction loss. The metric is also related to the ratedistortion theory [8], in which the maximum distortion is defined according to human perception, which however inevitably introduces a bias. There is also relation with the field of Compressed Sensing [10] in which the sparsity of the data is exploited allowing us to reconstruct it with fewer samples than the NyquistShannon theorem requires and the field of Robust Feature Extraction [20] where robust features are generated with the aim to characterize the data. Olshausen et al. [31] presented an objective function that considers subjective measures of sparseness of the activation maps, however in this work we use the direct measure of compression ratio. Previous work by [44] have used a weighted combination of the number of neurons, percentage rootmeansquared difference and a correlation coefficient for the optimization function of a FNN as a metric but without taking consideration the number of nonzero activations.
A limitation of SANs is the use of varying amplitudeonly kernels, which are not sufficient for more complex data and also do not fully utilize the compressibility of the data. A possible solution would be using a grid sampler [19] on the kernel allowing it to learn more general transformations (such as scale) than simple amplitude variability. However, additional kernel properties should be chosen in correspondence with the metric i.e. the model should compress more with decreased reconstruction loss.
Vi Conclusions and future work
In this paper first we defined the metric to evaluate how well do models tradeoff reconstruction loss with compression. We then defined SANs which have minimal structure and with the use of sparse activation functions learn to compress data without losing important information. Using Physionet datasets and MNIST we demonstrated that SANs are able to create high quality representations with interpretable kernels.
The minimal structure of SANs makes them easy to use for feature extraction, clustering and timeseries forecasting. Other future work on SANs include:

Applying cosine annealing to the extrema distance in order to increase the degree of freedom of the kernels.

Imposing the minimum extrema distance along all similarity matrices for multiple kernels, thus making kernels compete for territory.

Applying dropout at the activations in order to correct weights that have overshot, especially when they are initialized with high values. However, the effect of dropout on SANs would generally be negative since SANs have much less weights than DNNs thus need less regularization.

Using SANs with dynamically created kernels that might be able to learn multimodal data from variable sources (e.g. from ECG to respiratory) without destroying previous learned weights.
Acknowledgment
This work was supported by the European Union’s Horizon 2020 research and innovation programme under Grant agreement 769574. We gratefully acknowledge the support of NVIDIA with the donation of the Titan X Pascal GPU used for this research.
Footnotes
 The use of the symbol comes from the first character of the combined greek word \textgreekfontφλύθος = \textgreekfontφλύ+θος (flithos: pronounced as fleethos). It consists of the first part of the word \textgreekfontφλύαρος (meaning verbose) and the second part of \textgreekfontλάθος (meaning wrong). \textgreekfontΦλύθος is literally defined as: Giving inaccurate information using many words; the state of being wrong and wordy at the same time.
 Previous literature refers it as the ‘hidden variable’ but we use a more direct naming that suits the context of this paper.
 We use convolution instead of crosscorrelation only as a matter of compatibility with previous literature and computational frameworks. Using crosscorrelation would produce the same results and would not require flipping the kernels during visualization.
 https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition
References
 (2017) Nettrim: convex pruning of deep neural networks with performance guarantee. In Advances in Neural Information Processing Systems, pp. 3177–3186. Cited by: §I.
 (2001) Indications of nonlinear deterministic and finitedimensional structures in time series of brain electrical activity: dependence on recording region and brain state. Physical Review E 64 (6), pp. 061907. Cited by: §I.
 (2015) On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §I.
 (2015) Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156. Cited by: §I.
 (1994) Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §I.
 (2019) Deep learning in cardiology. IEEE reviews in biomedical engineering 12, pp. 168–193. Cited by: §I.
 (2018) The description length of deep learning models. In Advances in Neural Information Processing Systems, pp. 2216–2226. Cited by: §I.
 (1971) Rate distortion theory. PrenticeHall. Cited by: §V.
 (1996) Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. Journal of computational neuroscience 3 (2), pp. 91–110. Cited by: §I.
 (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §V.
 (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, Cited by: §I.
 (2011) Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §I.
 (2000) PhysioBank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220. Cited by: §I.
 (201317–19 Jun) Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, Vol. 28, pp. 1319–1327. Cited by: §I.
 (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §I.
 (2015) Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §I.
 (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I.
 (2018) Firingrate models for neurons with a broad repertoire of spiking behaviors. Journal of computational neuroscience 45 (2), pp. 103–132. Cited by: §I.
 (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §V.
 (2013) Deep learning for robust feature generation in audiovisual emotion recognition. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 3687–3691. Cited by: §V.
 (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §I, §IV.
 (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §I.
 (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
 (2003) Communication in neuronal networks. Science 301 (5641), pp. 1870–1874. Cited by: §I.
 (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §I.
 (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §I.
 (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. Cited by: §I.
 (2013) Ksparse autoencoders. arXiv preprint arXiv:1312.5663. Cited by: §I.
 (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814. Cited by: §I.
 (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §V.
 (1996) Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), pp. 607. Cited by: §V.
 (2017) Automatic differentiation in pytorch. Cited by: §IV.
 (2007) A network that uses few active neurones to code visual input predicts the diverse shapes of cortical receptive fields. Journal of computational neuroscience 22 (2), pp. 135–146. Cited by: §I.
 (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §I.
 (1986) Learning representations by backpropagating errors. nature 323 (6088), pp. 533. Cited by: §I.
 (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §I.
 (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I.
 (2002) Occam’s razor as a formal basis for a physical theory. Foundations of Physics Letters 15 (2), pp. 107–135. Cited by: §V.
 (1964) A formal theory of inductive inference. part i. Information and control 7 (1), pp. 1–22. Cited by: §V.
 (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §I.
 (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §I.
 (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §I, §IVC1.
 (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §I.
 (2017) ECG data compression using a neural network model based on multiobjective optimization. PloS one 12 (10), pp. e0182500. Cited by: §V.
 (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §I.